A Unifying Lens on SFT Through Target Distribution Design

Tong Xie¹ · Yuanhao Ban^1,2 · Yunqi Hong¹ · Sohyun An¹ · Yihang Chen¹ · Cho-Jui Hsieh^1,2

¹University of California, Los Angeles (UCLA) · ²Arena

Overview

SFT is usually studied through its loss function, but each loss implicitly defines a target distribution the model is pushed toward. We make this view explicit with the Q-target framework, which unifies existing SFT variants under two key design choices, and propose Target-SFT to adaptively shape both choices.

1 Framework. We view the SFT target as a mixture $Q_t = \gamma_t \delta_{y_t} + (1-\gamma_t)\tilde\pi_t$, which relaxes imitation and allows alternatives when the label is uncertain.
2 Unify. Many existing SFT variants represent implicit choices of $\gamma_t$ and $\tilde\pi_t$.
3 Target-SFT. We design both branches explicitly: $\gamma_t$ via model probability, and $\tilde\pi_t$ via a teacher-guided distribution. It outperforms across all 10 SFT settings evaluated.

Motivation

Loss is a surrogate. What target does it lead to?

Invert the cross-entropy gradient to find out: $Q_k = p_k - g_k$.

Illustration: dataset token → loss gradients → induced target distribution

At each prefix $x_t$, the model outputs logits $z$ and a distribution $p = \text{softmax}(z)$. Now suppose there exists some target distribution $Q$ over the vocabulary that we want the model to match. If we train with cross-entropy towards $Q$, then the gradient with respect to a logit $k$ is

$$\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_k} = p_k - Q_k$$

Then given any differentiable loss $\mathcal{L}$ with logit gradient $g_k = \partial \mathcal{L}/\partial z_k$, we can invert this relationship to find the induced target distribution under this update rule:

$$Q_k := p_k - g_k$$

This reveals what each loss is actually teaching the model, via the probability updates it defines.

Beyond the choice of loss, SFT is a choice of target distribution.

Loss is the mechanism; the target is what the model actually learns.

Example

From Loss to Target

Standard SFT learns one-hot target $\delta_{y_t}$; p-loss interpolates from $p$ to $\delta_{y_t}$.

We illustrate this with two losses: standard SFT and p-loss (a token-level variant that scales SFT loss by the model's current probability $p_y$). Applying $Q_k = p_k - g_k$ to each derives the target.

Step-by-step derivation Loss → Logit gradient → Induced target

Standard SFT

Loss

$\mathcal{L}_\text{SFT} = -\log \pi_\theta(y_t \mid x_t)$

Logit gradient

$g_k = \begin{cases} p_{y_t} - 1 & k = y_t \\ p_k & k \neq y_t \end{cases}$

Apply $Q_k = p_k - g_k$

Induced target

$Q_\text{SFT}(k) = \delta_{y_t} = \begin{cases} 1 & k = y_t \\ 0 & k \neq y_t \end{cases}$

Places all probability mass on the observed token $y_t$.

p-loss

Loss

$\mathcal{L} = -\mathrm{sg}(p_y)\log\pi_\theta(y_t),\ \ p_y=\pi_\theta(y_t \mid x_t)$

Logit gradient

$g_k = \begin{cases} -p_y(1 - p_y) & k = y_t \\ p_y\,p_k & k \neq y_t \end{cases}$

Apply $Q_k = p_k - g_k$

Induced target

$Q_\text{p-loss}(k) = \begin{cases} 2p_y - p_y^2 & k = y_t \\ (1-p_y)\,p_k & k \neq y_t \end{cases}$

Target depends on $p_y$: stays close to current $p$ when uncertain, approaches $\delta_{y_t}$ only as $p_y \to 1$.

Interactive

This visualizes the comparison. It plots the effects of each loss function, on both the observed token $y_t$ (solid), and a non-observed token $k \ne y_t$ (dashed). Hover on legend to highlight corresponding curves.

Model probability in observed token: $p_y$ = 0.30

Low confidenceMediumHigh confidence

Method SFT p-loss

Token Observed $y_t$ Other tokens $k$

Loss Gradient $\;\partial\mathcal{L}/\partial z_k$

$1-p_y$ $-p_k$ $p_y(1-p_y)$ $-p_y p_k$

Induced Q-Target

$1$ $2p_y - p_y^2$ $0$ $(1-p_y)\,p_k$

x-axis = current model probability on the token of interest: $p_y$ for the observed token (solid), $p_k$ for other tokens (dashed).

SFT

Constant updates

Gradient pulls toward $y_t$ and suppresses all $k$ with fixed strength, no matter how certain the model already is.

Q-target Always $\delta_{y_t}$ (full probability mass on the observed token)

p-loss

Confidence-scaled updates

Gradient scales with $p_y$: near-zero when uncertain ($p_y \approx 0$), approaches standard SFT strength as $p_y \to 1$.

Q-target $p_y \to 0$: $Q \to p$ (no change); $p_y \to 1$: $Q \to \delta_{y_t}$ (SFT)

Standard SFT's implicit assumption: every observed token is ideal and uniquely correct, regardless of noise, ambiguity, or alignment with the model's existing knowledge.

Framework

The Q-Target View

Relax the one-hot target $\delta_{y_t}$ to account for label uncertainty.

But an observed token may be noisy, non-unique, or misaligned with the model. Instead of forcing a rigid one-hot target, we explicitly model for this uncertainty. We replace $\delta_y$ with a mixture distribution $Q_t$ controlled by two design choices:

$$Q_t = \gamma_t \,\delta_{y_t} + (1 - \gamma_t)\,\tilde{\pi}_t$$

Component	Design Question	Effect
$\gamma_t \in [0,1]$	How much should we trust the observed token $y_t$?	Controls imitation strength
$\tilde{\pi}_t \in \Delta^{\|\mathcal{V}\|}$	Where should the remaining $(1-\gamma_t)$ mass go?	Shapes alternative supervision

The training objective under $Q_t$ decomposes cleanly:

$$\mathcal{L}_Q(\theta) = {\color{#d97706}\gamma_t} \underbrace{\text{CE}(\delta_{y_t},\, \pi_\theta)}_{\text{imitate } y_t} + {\color{#d97706}(1-\gamma_t)} \underbrace{\text{CE}({\color{#2563eb}\tilde\pi_t},\, \pi_\theta)}_{\text{match alternatives}}$$

This view shows that SFT training balances two forces: imitation of the label and matching of a residual distribution. Standard SFT simply sets $\gamma_t = 1$, collapsing the second term entirely.

Unifying Lens

SFT variants are implicit Q-target designs

Seemingly different losses vary only in the choice of $\gamma_t$ and $\tilde\pi_t$.

Click any row to see method details ·

Method	Category	$\gamma_t$	$\tilde\pi_t$
Standard SFT	One-hot	$1$	—
Objective $$\ell_t^\text{SFT} = -\log p_t$$ Motivation Maximize likelihood of every observed token.
DFT	Label Trust	$p_t$	$\pi_\theta(\cdot\mid x_t)$
Objective $$\ell_t^\text{DFT} = -\text{sg}[p_t]\log p_t$$ Motivation Use probability weighting to connect SFT with an RL-style objective.Paper ↗
Beyond-log	Label Trust	$p_t^\alpha$	$\pi_\theta(\cdot\mid x_t)$
Objective $$\ell_t^f = f(p_t), \quad \ell_t^\alpha = \frac{1-p_t^\alpha}{\alpha}$$ Motivation Use probability-dependent objectives to balance learning across model capacities.Paper ↗
ProFiT	Label Trust	$m_t = \mathbf{1}\{p_t > \tau\}$	$\pi_\theta(\cdot\mid x_t)$
Objective $$m_t = \mathbf{1}[\text{sg}(p_t) > \tau], \quad \ell_t^\text{ProFiT} = -m_t\log p_t$$ Motivation Use probability to identify and train on core tokens.Paper ↗
EAFT	Label Trust	$\tilde{H}_t = H(\pi_{\theta,t}^{(k)})/\log k$	$\pi_\theta(\cdot\mid x_t)$
Objective $$\tilde{H}_t = \text{sg}\!\left[\frac{H(\pi_{\theta,t}^{(k)})}{\log k}\right], \quad \ell_t^\text{EAFT} = -\tilde{H}_t\log p_t$$ Motivation Use entropy to weight uncertain or knowledge-conflicting tokens.Paper ↗
iw-SFT	Label Trust	$w(\tau) = q(\tau)/\pi_\text{ref}(\tau)$	$\pi_\theta(\cdot\mid x_t)$
Objective $$w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)}\;(w\text{ trajectory-level}), \quad \ell_t^\text{iw} = -w(\tau)\log p_t$$ Motivation Use an auxiliary distribution to assign trajectory-level weights.Paper ↗
CFT	Label Trust	$c_t = \mathbf{1}\{y_t \text{ critical}\}$	$\pi_\theta(\cdot\mid x_t)$
Objective $$c_t = \mathbf{1}\!\left[\forall\tilde{y}_t \in \mathcal{A}_t,\;\text{Correct}(y_{<t},\tilde{y}_t,y_{>t})=0\right]$$ $$\ell_t^\text{CFT} = -c_t\log p_t$$ Motivation Update only causally critical / irreplaceable tokens.Paper ↗
Label Smoothing	Residual Dist.	$1 - \lambda$	$\text{Unif}(\mathcal{V})$
Objective $$\ell_t^\text{LS} = -\!\left[(1-\lambda)\log p_t + \frac{\lambda}{\|\mathcal{V}\|}\sum_{v\in\mathcal{V}}\log p_{\theta,t}(v)\right]$$ Motivation Regularize overconfident predictions for better calibration.Paper ↗
SFT + KL	Residual Dist.	$\frac{1}{1+\lambda}$	$\pi_\text{ref}(\cdot\mid x_t)$
Objective $$\ell_t^\text{KL} = -\log p_t + \lambda\,\text{KL}(\pi_\text{ref}(\cdot\mid x_t)\\|\pi_\theta(\cdot\mid x_t))$$ Motivation Constrain updates with a reference model to limit drift.Paper ↗
ASFT	Residual Dist.	$\frac{p_t}{p_t+\lambda}$	$\pi_\text{base}(\cdot\mid x_t)$
Objective $$\ell_t^\text{ASFT} = \ell_t^\text{DFT} + \lambda\,\text{KL}(\pi_\text{base}(\cdot\mid x_t)\\|\pi_\theta(\cdot\mid x_t))$$ Motivation Constrain updates in DFT to prevent distributional drift.Paper ↗
Proximal SFT	Residual Dist.	clipping-dependent	$\pi_\text{old}(\cdot\mid x_t)$
Objective $$r_t = \frac{p_t}{\pi_\text{old}(y_t\mid x_t)}, \quad \ell_t^\text{PSFT} = -\min\!\left(r_t,\,\text{clip}(r_t,1-\epsilon,1+\epsilon)\right)$$ Motivation Clip ratio to enforce updates within a trust region.Paper ↗
GEM	Residual Dist.	$\gamma_t^y = 1,\;\gamma_t^- = 1$	$\tilde\pi_t^+ = \pi_\theta(\cdot\mid x_t),\;\tilde\pi_t^- = \pi_\theta^{(\beta)}(\cdot\mid x_t)$
Objective $$q_t(v) = \frac{\text{sg}[\pi_{\theta,t}(v)]^{1/\beta}}{\sum_{u\in\mathcal{V}}\text{sg}[\pi_{\theta,t}(u)]^{1/\beta}}$$ $$\ell_t^\text{GEM} = \text{CE}(\delta_{y_t},\pi_{\theta,t}) - \text{CE}(q_t,\pi_{\theta,t})$$ Motivation Control probability transfer from alternatives to the observed token to preserve diversity.Paper ↗
Knowledge Distillation	Residual Dist.	$0$	$\pi_T(\cdot\mid x_t)$
Objective $$\ell_t^\text{KD} = -\sum_{v\in\mathcal{V}}\pi_T(v\mid x_t)\log\pi_S(v\mid x_t)$$ Motivation Use the teacher logit distribution as a soft target.Paper ↗
Distillation (Hybrid)	Residual Dist.	$1 - \lambda$	$\pi_T(\cdot\mid x_t)$
Objective $$\ell_t^\text{KD-H} = (1-\lambda)[-\log\pi_S(y_t\mid x_t)] + \lambda\,D_\text{KL}(\pi_T(\cdot\mid x_t)\\|\pi_S(\cdot\mid x_t))$$ Motivation Combine hard-label imitation with teacher logit distribution for enriched soft supervision.Paper ↗
Target-SFT	Both	$p_t$	$\tilde\pi_t^\text{guided} \propto \pi_\theta(\cdot\mid x_t)^{1-\eta}\,\pi_T(\cdot\mid x_t)^\eta$
Objective $$Q_t^\text{TARGET} = p_t\,\delta_{y_t} + (1-p_t)\,\tilde{\pi}_t^\text{guided}$$ $$\ell_t^\text{TARGET} = \text{CE}(\pi_\theta(\cdot\mid x_t),\, Q_t^\text{TARGET})$$ Motivation Adaptively balance label imitation with teacher-guided fallback. Teacher influence scales with model uncertainty $1-p_t$.

Target-SFT

Design both branches of Q-target

Define proxy for $\gamma_t$ and adaptively use teacher guidance in alternatives $\tilde\pi_t$.

1Label Trust $\gamma_t$

Model probability $\pi_\theta(y_t \mid x_t)$ naturally encodes the support for $y_t$ among all plausible continuations, based on statistical evidence from pretraining. We use this as proxy for label reliability:

$$\gamma_t \;=\; \pi_\theta(y_t \mid x_t)=\; p_y \;$$

2Residual Distribution $\tilde{\pi}_t$

To preserve model prior while allowing external supervision, a teacher distribution provides reward-style signals to reshape $\pi_\theta(\cdot\mid x_t)$. This yields a teacher-guided distribution with closed form:

$$\tilde{\pi}_t^\text{guided}(a) \;\propto\; \pi_\theta(a)^{1-\eta}\,\pi_T(a)^{\eta}$$

Target-SFT Putting it together

$$Q_t^\text{TARGET} \;=\; p_y\,\delta_{y_t} \;+\; (1-p_y)\,\tilde{\pi}_t^\text{guided}$$

Trusted token gets strong supervision, with target approaches standard SFT's one-hot $\delta_{y_t}$; while uncertain token allocates higher weight to teacher-guided alternatives, approaches $\tilde{\pi}_t$. We train with cross-entropy loss to match $Q_t^\text{TARGET}$.

Results

Target-SFT improves reasoning performance across 10 dataset-model settings on math and medical benchmarks. While the baselines fluctuate, Target-SFT consistently achieves the best Average@16 results in every setting.

Performance summary: Average@16 accuracy across all 10 dataset-model settings

BibTeX

@article{xie2026targetsft,
  title     = {A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design},
  author    = {Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh},
  journal   = {arXiv},
  year      = {2026}
}