Tong Xie

Los Angeles, CA

tongxie@ucla.edu

hi there! 👋

I am a first-year PhD student in Computer Science at UCLA, fortunate to be advised by Prof. Cho-Jui Hsieh.

I am broadly interested in Post-training of Large Language Models. Currently, I am working on LLM supervised fine-tuning (SFT), reinforcement learning (RL), and reward modeling, to improve reasoning capabilities and encourage stronger generalization.

Feel free to connect with me and explore opportunities, collaborations, and exciting ventures together!

news

Jun 09, 2026	Our new work *A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design* is now available on arXiv. Check out our project page!
May 07, 2026	Our work *When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models* is accepted to `ICML 2026`!
Jun 03, 2024	I am excited to intern with `QSG`, RBC’s buyside quant group, for summer 2024.
Jun 01, 2024	Our work *Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation* is accepted to `TMLR 2024`!
Jun 26, 2023	I am excited to be part of Summer Undergraduate Research Program (SURP) for summer 2023. Check out our poster.

selected publications

Preprint

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, and Cho-Jui Hsieh

2026

Abs PDF

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.
ICML 2026

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, and Cho-Jui Hsieh

International Conference on Machine Learning (ICML), 2026

Abs PDF

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5 precent on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.
TMLR 2024

Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

Tong Xie, Haoyu Li, Andrew Bai, and Cho-Jui Hsieh

Transactions on Machine Learning Research (TMLR), 2024

Abs PDF Code

Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand "black-box" neural networks. While prior research established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts, posing a significant challenge to extend existing frameworks to diffusion models directly. Notably, we present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples’ loss gradient norms are highly dependent on timestep. This trend leads to a prominent bias in influence estimation, and is particularly severe for samples trained on large-norm-inducing timesteps, causing them to be generally influential. To mitigate this bias, we introduce Diffusion-ReTrac as a re-normalized adaptation that retrieves training samples targeted to the test sample of interest, enabling a localized measurement of influence and considerably more intuitive visualization. We demonstrate the efficacy of our approach through various evaluation metrics and auxiliary tasks, outperforming in terms of specificity of attribution by over 60%.
Preprint

Does Few-Shot Learning Help LLM Performance in Code Synthesis?

Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, and Wei Wang

2024

Abs PDF

Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM’s coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama’s coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.