Anikait Singh

I'm a second year Ph.D Student at Stanford AI advised by Chelsea Finn. I closely collaborate with Aviral Kumar and Tatsu Hashimoto. Previously, I was at UC Berkeley advised by Sergey Levine. I was also a student researcher at Google DeepMind Robotics, and Toyota Research Institute. My research is supported by the NSF Graduate Research Fellowship.

Email  /  CV  /  Scholar  /  Twitter  /  Github

profile photo

Research

My research focuses on advancing decision-making methodologies, particularly in reinforcement learning, and scaling these approaches to broader applications. I am particularly interested in multi-step reasoning in complex domains such as code and mathematics, as well as designing alignment algorithms and personalization techniques. My ultimate goal is to develop foundational models for decision-making that leverage diverse, large-scale datasets to achieve robust generalization, efficient rapid learning, and adaptability to individual needs.

understanding_rlhf
Few-Shot Preference Optimization of Synthetic Preferences Elicits Effective Personalization to Real Users
Anikait Singh*, Sheryl Hsu*, Kyle Hsu, Eric Mitchell, Tatsu Hashimoto, Archit Sharma, Chelsea Finn
ACL, 2025

As language models increasingly interact with a diverse user base, it becomes important for models to generate responses that align with individual user preferences. Few-Shot Preference Optimization (FSPO) is a meta-learning framework that leverages the strong in-context learning capabilities of an LLM to capture the diversity of human preferences.

understanding_rlhf
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
CoLM, 2025

This study demonstrates that a language model's intrinsic reasoning behaviors—such as verification, backtracking, subgoal setting, and backward chaining—are key to its ability to self-improve under reinforcement learning.

understanding_rlhf
Improving Test-Time Search in LLMs with Backtracking Against In-Context Value Verifiers
Anikait Singh, Kushal Arora, Sedrick Keh, Jean Mercat, Tatsunori Hashimoto, Chelsea Finn, Aviral Kumar
ICLR, 2025, Workshop on Reasoning and Planning for LLMs (Runner-Up Best Paper)

This paper introduces a novel approach combining process verifiers with preemptive backtracking to efficiently identify and resample problematic steps, significantly reducing computation through focused revisions and in-context process supervision.

understanding_rlhf
Test-Time Alignment via Hypothesis Reweighting
Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn
Preprint, 2024

We introduce HyRe, a test-time adaptation framework that trains a single neural network with multiple prediction heads—each encoding a different behavior consistent with the training data—and dynamically reweights them using a small set of labeled examples from the target distribution. With only five preference pairs per distribution, HyRe scales to large pretrained models at roughly the cost of fine-tuning one model and outperforms prior state-of-the-art 2B-parameter reward models across 18 personalization and distribution-shift scenarios.

understanding_rlhf
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms.

understanding_rlhf
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, Nick Haber
CoLM, 2025

Big-Math is a dataset of over 250,000 high-quality, verifiable math questions created for reinforcement learning that bridges the gap between quality and quantity by rigorously filtering existing datasets and introducing 47,000 new reformulated questions, offering greater diversity and varying difficulty levels compared to commonly used alternatives like GSM8k and MATH.

understanding_rlhf
Personalized Preference Fine-tuning of Diffusion Models
Meihua Dang*, Anikait Singh*, Linqi Zhou, Stefano Ermon, Jiaming Song
CVPR, 2025

PPD is a multi-reward optimization approach that personalizes text-to-image diffusion models by extracting user preference embeddings from a few examples and incorporating them via cross-attention during DPO fine-tuning, enabling effective generalization to unseen users with as few as four preference examples.

understanding_rlhf
Adaptive inference-time compute: Large Language Models can predict if they can do better, even mid-generation
Rohin Manvi, Anikait Singh, Stefano Ermon
Preprint, 2024

This paper proposes a generative self-evaluation scheme enabling large language models (LLMs) to internally predict mid-generation whether restarting would yield improved responses, significantly reducing computational costs by adaptively limiting sample generation. Experiments show this method achieves most of the performance gains of extensive multi-sample generation (e.g., increasing Llama 3.1 8B's win rate against GPT-4 from 21% to 34%) with dramatically fewer samples and minimal overhead.

understanding_rlhf
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Anikait Singh*, Fahim Tajwar*, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar
ICML, 2024

Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Our new work finds that approaches employing on-policy sampling or negative gradients outperform offline, maximum likelihood objectives.

understanding_rlhf
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Rafael Rafailov*, Kyle Beltran Hatch*, Anikait Singh, Aviral Kumar, Laura Smith, Ilya Kostrikov, Philippe Estruch, Victor Kolev, Philip J. Ball, Jiajun Wu, Sergey Levine, Chelsea Finn
RLC, 2024

Offline RL algorithms enable data-driven methods without the need for costly or dangerous real-world exploration, leveraging large pre-collected datasets. However, effective and challenging benchmarks that capture real-world task properties are necessary for evaluating progress, prompting the proposal of a new benchmark for offline RL based on realistic robotic simulations and diverse data sources to support both offline RL and online fine-tuning evaluation.

workflow
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
Chethan Bhateja*, Derek Guo*, Dibya Ghosh*, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar
ICRA, 2024

VPTR is a framework that combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that are robust and generalizable.

workflow
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration
CoRL, 2024

This is an opensource dataset comprised of a large collection of robot embodiments. We study how vision-language models trained on X-Embodiment Datasets can enable efficient adaptation to new robots, tasks, and environments.

workflow
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Google DeepMind Robotics
ICRA, 2023

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning.

workflow
Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints
Anikait Singh*, Aviral Kumar*, Quan Vuong, Yevgen Chebotar, Sergey Levine
NeurIPS, 2023

CQL (ReDS) is an offline RL method that modifies a typical distribution constraint into an approximate support-level constraint via re-weighting to enable efficient learning from heteroskedastic dataset compositions.

workflow
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
Mitsuhiko Nakamoto*, Yuexiang Zhai*, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, Sergey Levine
NeurIPS, 2023

A method that learns a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. This leads to effective online fine-tuning, enabling benefits of offline initializations in online fine-tuning

workflow
Pre-Training for Robots: Offline RL Enables Learning from a Handful of Trials
Aviral Kumar*, Anikait Singh*, Frederik Ebert*, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, Sergey Levine
RSS, 2023

PTR is a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as few as 10 demonstrations.

workflow
When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?
Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
ICLR, 2022

Theoretical paper that characterize the properties of environments that allow offline RL methods to perform better than BC methods, even when only provided with expert data. Additionally, policies trained on sufficiently noisy suboptimal data outperform BC algorithms with expert data, especially on long-horizon problems.

workflow
A Workflow for Offline Model-Free Robotic Reinforcement Learning
Aviral Kumar*, Anikait Singh*, Stephen Tian, Chelsea Finn, Sergey Levine
CoRL, 2021, (Oral Presentation)

Our proposed workflow aims to detect overfitting and underfitting in model-free offline RL, and provides guidelines for addressing these issues via policy selection, regularization, and architecture design.

workflow
A Mobile Application for Keyword Search in Real-World Scenes
Shrinivas Pundlik, Anikait Singh, Gautam Baghel, Vilte Baliutaviciute, Gang Luo
IEEE Journal of Translational Engineering in Health and Medicine, 2019

System to help visually-impaired patients localize where words are present in a cluttered environment. This system utilizes OCR + Levenshtein Distance along with specialized audio cues and additional assistive features to enable efficient and intuitive search in crowded, diverse environments.

Teaching

csteaching

Graduate Student Instructor, CS224r Spring 2025

Program Coordinator, Mentor, Deep Learning Portal 2024

csteaching

Undergraduate Student Instructor, CS285 Fall 2022

Undergraduate Student Instructor, CS188 Spring 2022

Undergraduate Student Instructor, CS285 Fall 2021