Anikait Singh
I'm a second year Ph.D Student at Stanford AI advised by Chelsea Finn. I closely collaborate with Aviral Kumar and Tatsu Hashimoto. Previously, I was at UC Berkeley advised by Sergey Levine. I was also a student researcher at Google DeepMind Robotics, and Toyota Research Institute. My research is supported by the NSF Graduate Research Fellowship.
Email /
CV /
Scholar /
Twitter /
Github
|
|
Research
My research focuses on advancing decision-making methodologies, particularly in reinforcement learning, and scaling these approaches to broader applications. I am particularly interested in multi-step reasoning in complex domains such as code and mathematics, as well as designing alignment algorithms and personalization techniques. My ultimate goal is to develop foundational models for decision-making that leverage diverse, large-scale datasets to achieve robust generalization, efficient rapid learning, and adaptability to individual needs.
Few-Shot Preference Optimization of Synthetic Preferences Elicits Effective Personalization to Real Users
Anikait Singh*, Sheryl Hsu*, Kyle Hsu, Eric Mitchell, Tatsu Hashimoto, Archit Sharma, Chelsea Finn
ACL, 2025
As language models increasingly interact with a diverse user base, it becomes important for models to generate responses that align with individual user preferences. Few-Shot Preference Optimization (FSPO) is a meta-learning framework that leverages the strong in-context learning capabilities of an LLM to capture the diversity of human preferences.
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
CoLM, 2025
This study demonstrates that a language model's intrinsic reasoning behaviors—such as verification, backtracking, subgoal setting, and backward chaining—are key to its ability to self-improve under reinforcement learning.
Improving Test-Time Search in LLMs with Backtracking Against In-Context Value Verifiers
Anikait Singh, Kushal Arora, Sedrick Keh, Jean Mercat, Tatsunori Hashimoto, Chelsea Finn, Aviral Kumar
ICLR, 2025, Workshop on Reasoning and Planning for LLMs (Runner-Up Best Paper)
This paper introduces a novel approach combining process verifiers with preemptive backtracking to efficiently identify and resample problematic steps, significantly reducing computation through focused revisions and in-context process supervision.
Test-Time Alignment via Hypothesis Reweighting
Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn
Preprint, 2024
We introduce HyRe, a test-time adaptation framework that trains a single neural network with multiple prediction heads—each encoding a different behavior consistent with the training data—and dynamically reweights them using a small set of labeled examples from the target distribution. With only five preference pairs per distribution, HyRe scales to large pretrained models at roughly the cost of fine-tuning one model and outperforms prior state-of-the-art 2B-parameter reward models across 18 personalization and distribution-shift scenarios.
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn
We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We
present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search,
and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search
algorithms.
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, Nick Haber
CoLM, 2025
Big-Math is a dataset of over 250,000 high-quality, verifiable math questions created for reinforcement learning that bridges the gap between quality and quantity by rigorously filtering existing datasets and introducing 47,000 new reformulated questions, offering greater diversity and varying difficulty levels compared to commonly used alternatives like GSM8k and MATH.
Personalized Preference Fine-tuning of Diffusion Models
Meihua Dang*, Anikait Singh*, Linqi Zhou, Stefano Ermon, Jiaming Song
CVPR, 2025
PPD is a multi-reward optimization approach that personalizes text-to-image diffusion models by extracting user preference embeddings from a few examples and incorporating them via cross-attention during DPO fine-tuning, enabling effective generalization to unseen users with as few as four preference examples.
Adaptive inference-time compute: Large Language Models can predict if they can do better, even mid-generation
Rohin Manvi, Anikait Singh, Stefano Ermon
Preprint, 2024
This paper proposes a generative self-evaluation scheme enabling large language models (LLMs) to internally predict mid-generation whether restarting would yield improved responses, significantly reducing computational costs by adaptively limiting sample generation. Experiments show this method achieves most of the performance gains of extensive multi-sample generation (e.g., increasing Llama 3.1 8B's win rate against GPT-4 from 21% to 34%) with dramatically fewer samples and minimal overhead.
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Anikait Singh*, Fahim Tajwar*, Archit Sharma, Rafael Rafailov,
Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar
ICML, 2024
Learning from preferences is a common paradigm for fine-tuning language models.
Yet, many algorithmic design decisions come into play. Our new work finds that
approaches employing on-policy sampling or negative gradients outperform offline,
maximum likelihood objectives.
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Rafael Rafailov*, Kyle Beltran Hatch*, Anikait Singh, Aviral Kumar,
Laura Smith, Ilya Kostrikov, Philippe Estruch, Victor Kolev,
Philip J. Ball, Jiajun Wu, Sergey Levine, Chelsea Finn
RLC, 2024
Offline RL algorithms enable data-driven methods without the need for
costly or dangerous real-world exploration, leveraging large pre-collected datasets.
However, effective and challenging benchmarks that capture real-world task properties are
necessary for evaluating progress, prompting the proposal of a new benchmark for offline RL
based on realistic robotic simulations and diverse data sources to
support both offline RL and online fine-tuning evaluation.
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
Chethan Bhateja*, Derek Guo*, Dibya Ghosh*, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar
ICRA, 2024
VPTR is a framework that combines the benefits of pre-training on video data with robotic offline RL approaches
that train on diverse robot data, resulting in value functions and policies for manipulation tasks that are robust
and generalizable.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration
CoRL, 2024
This is an opensource dataset comprised of a large collection of robot embodiments. We study how vision-language
models trained on X-Embodiment Datasets can enable efficient adaptation to new robots, tasks, and environments.
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
Mitsuhiko Nakamoto*, Yuexiang Zhai*, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, Sergey Levine
NeurIPS, 2023
A method that learns a conservative value function initialization that underestimates the value of the
learned policy from offline data, while also being calibrated, in the sense that the learned Q-values
are at a reasonable scale. This leads to effective online fine-tuning, enabling benefits of offline
initializations in online fine-tuning
Pre-Training for Robots: Offline RL Enables Learning from a Handful of Trials
Aviral Kumar*, Anikait Singh*, Frederik Ebert*, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, Sergey Levine
RSS, 2023
PTR is a framework based on offline RL that attempts to effectively learn new tasks by combining
pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as few as
10 demonstrations.
When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?
Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
ICLR, 2022
Theoretical paper that characterize the properties of environments that allow offline RL methods to
perform better than BC methods, even when only provided with expert data. Additionally, policies trained
on sufficiently noisy suboptimal data outperform BC algorithms with expert data, especially on long-horizon problems.
A Workflow for Offline Model-Free Robotic Reinforcement Learning
Aviral Kumar*, Anikait Singh*, Stephen Tian, Chelsea Finn, Sergey Levine
CoRL, 2021, (Oral Presentation)
Our proposed workflow aims to detect overfitting and underfitting in model-free offline RL,
and provides guidelines for addressing these issues via policy selection, regularization, and architecture design.
A Mobile Application for Keyword Search in Real-World Scenes
Shrinivas Pundlik, Anikait Singh, Gautam Baghel, Vilte Baliutaviciute, Gang Luo
IEEE Journal of Translational Engineering in Health and Medicine, 2019
System to help visually-impaired patients localize where words are present in a cluttered environment. This system
utilizes OCR + Levenshtein Distance along with specialized audio cues and additional assistive features to enable
efficient and intuitive search in crowded, diverse environments.
|
|