A Unifying Lens on SFT Through Target Distribution Design

Tong Xie1  ·  Yuanhao Ban1,2  ·  Yunqi Hong1  ·  Sohyun An1  ·  Yihang Chen1  ·  Cho-Jui Hsieh1,2

1University of California, Los Angeles (UCLA)  ·  2Arena

SFT is usually studied through its loss function, but each loss implicitly defines a target distribution the model is pushed toward. We make this view explicit with the Q-target framework, which unifies existing SFT variants under two key design choices, and propose Target-SFT to adaptively shape both choices.

Loss is a surrogate. What target does it lead to?

Invert the cross-entropy gradient to find out: $Q_k = p_k - g_k$.

Illustration: dataset token → loss gradients → induced target distribution

At each prefix $x_t$, the model outputs logits $z$ and a distribution $p = \text{softmax}(z)$. Now suppose there exists some target distribution $Q$ over the vocabulary that we want the model to match. If we train with cross-entropy towards $Q$, then the gradient with respect to a logit $k$ is

$$\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_k} = p_k - Q_k$$
Then given any differentiable loss $\mathcal{L}$ with logit gradient $g_k = \partial \mathcal{L}/\partial z_k$, we can invert this relationship to find the induced target distribution under this update rule:

$$Q_k := p_k - g_k$$

This reveals what each loss is actually teaching the model, via the probability updates it defines.

Beyond the choice of loss, SFT is a choice of target distribution.
Loss is the mechanism; the target is what the model actually learns.

From Loss to Target

Standard SFT learns one-hot target $\delta_{y_t}$; p-loss interpolates from $p$ to $\delta_{y_t}$.

We illustrate this with two losses: standard SFT with negative log-likelihood, and p-loss, a token-level variant scaled by the model's current probability $p_y$. Applying $Q_k = p_k - g_k$ to each derives the target.

Step-by-step derivation Loss → Logit gradient → Induced target
Standard SFT
Loss
$\mathcal{L}_\text{SFT} = -\log \pi_\theta(y_t \mid x_t)$
Logit gradient
$g_k = \begin{cases} p_{y_t} - 1 & k = y_t \\ p_k & k \neq y_t \end{cases}$
Apply $Q_k = p_k - g_k$
Induced target
$Q_\text{SFT}(k) = \delta_{y_t} = \begin{cases} 1 & k = y_t \\ 0 & k \neq y_t \end{cases}$

Places all probability mass on the observed token $y_t$.

p-loss
Loss
$\mathcal{L} = -\mathrm{sg}(p_y)\log\pi_\theta(y_t),\ \ p_y=\pi_\theta(y_t \mid x_t)$
Logit gradient
$g_k = \begin{cases} -p_y(1 - p_y) & k = y_t \\ p_y\,p_k & k \neq y_t \end{cases}$
Apply $Q_k = p_k - g_k$
Induced target
$Q_\text{p-loss}(k) = \begin{cases} 2p_y - p_y^2 & k = y_t \\ (1-p_y)\,p_k & k \neq y_t \end{cases}$

Target depends on $p_y$: stays close to current $p$ when uncertain, approaches $\delta_{y_t}$ only as $p_y \to 1$.

Interactive

This visualizes the comparison. It plots the effects of each loss function, on both the observed token $y_t$ (solid), and a non-observed token $k \ne y_t$ (dashed). Hover on legend to highlight corresponding curves.

Model probability in observed token: $p_y$ = 0.30
Low confidenceMediumHigh confidence
Method SFT p-loss
Token Observed $y_t$ Other tokens $k$
Loss Gradient $\;\partial\mathcal{L}/\partial z_k$
$1-p_y$ $-p_k$ $p_y(1-p_y)$ $-p_y p_k$
Induced Q-Target
$1$ $2p_y - p_y^2$ $0$ $(1-p_y)\,p_k$

x-axis = current model probability on the token of interest: $p_y$ for the observed token (solid), $p_k$ for other tokens (dashed).

SFT

Constant updates

Gradient pulls toward $y_t$ and suppresses all $k$ with fixed strength, no matter how certain the model already is.

Q-target Always $\delta_{y_t}$ (full probability mass on the observed token)
p-loss

Confidence-scaled updates

Gradient scales with $p_y$: near-zero when uncertain ($p_y \approx 0$), approaches standard SFT strength as $p_y \to 1$.

Q-target $p_y \to 0$: $Q \to p$ (no change); $p_y \to 1$: $Q \to \delta_{y_t}$ (SFT)
Standard SFT's implicit assumption: every observed token is ideal and uniquely correct, regardless of noise, ambiguity, or alignment with the model's existing knowledge.

The Q-Target View

Relax the one-hot target $\delta_{y_t}$ to account for label uncertainty.

But an observed token may be noisy, non-unique, or misaligned with the model. Instead of forcing a rigid one-hot target, we explicitly model for this uncertainty. We replace $\delta_y$ with a mixture distribution $Q_t$ controlled by two design choices:

$$Q_t = \gamma_t \,\delta_{y_t} + (1 - \gamma_t)\,\tilde{\pi}_t$$
Component Design Question Effect
$\gamma_t \in [0,1]$ How much should we trust the observed token $y_t$? Controls imitation strength
$\tilde{\pi}_t \in \Delta^{|\mathcal{V}|}$ Where should the remaining $(1-\gamma_t)$ mass go? Shapes alternative supervision

The training objective under $Q_t$ decomposes cleanly:

$$\mathcal{L}_Q(\theta) = {\color{#d97706}\gamma_t} \underbrace{\text{CE}(\delta_{y_t},\, \pi_\theta)}_{\text{imitate } y_t} + {\color{#d97706}(1-\gamma_t)} \underbrace{\text{CE}({\color{#2563eb}\tilde\pi_t},\, \pi_\theta)}_{\text{match alternatives}}$$

This view shows that SFT training balances two forces: imitation of the label and matching of a residual distribution. Standard SFT simply sets $\gamma_t = 1$, collapsing the second term entirely.

Existing SFT variants are implicit Q-target designs

Seemingly different losses vary only in the choice of $\gamma_t$ and $\tilde\pi_t$.

Click any row to see method details ·
Method Category $\gamma_t$ $\tilde\pi_t$
Objective
$$\ell_t^\text{SFT} = -\log p_t$$
Motivation
Maximize likelihood of every observed token.
Objective
$$\ell_t^\text{DFT} = -\text{sg}[p_t]\log p_t$$
Motivation
Use probability weighting to connect SFT with an RL-style objective.Paper ↗
Objective
$$\ell_t^f = f(p_t), \quad \ell_t^\alpha = \frac{1-p_t^\alpha}{\alpha}$$
Motivation
Use probability-dependent objectives to balance learning across model capacities.Paper ↗
Objective
$$m_t = \mathbf{1}[\text{sg}(p_t) > \tau], \quad \ell_t^\text{ProFiT} = -m_t\log p_t$$
Motivation
Use probability to identify and train on core tokens.Paper ↗
Objective
$$\tilde{H}_t = \text{sg}\!\left[\frac{H(\pi_{\theta,t}^{(k)})}{\log k}\right], \quad \ell_t^\text{EAFT} = -\tilde{H}_t\log p_t$$
Motivation
Use entropy to weight uncertain or knowledge-conflicting tokens.Paper ↗
Objective
$$w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)}\;(w\text{ trajectory-level}), \quad \ell_t^\text{iw} = -w(\tau)\log p_t$$
Motivation
Use an auxiliary distribution to assign trajectory-level weights.Paper ↗
Objective
$$c_t = \mathbf{1}\!\left[\forall\tilde{y}_t \in \mathcal{A}_t,\;\text{Correct}(y_{<t},\tilde{y}_t,y_{>t})=0\right]$$ $$\ell_t^\text{CFT} = -c_t\log p_t$$
Motivation
Update only causally critical / irreplaceable tokens.Paper ↗
Objective
$$\ell_t^\text{LS} = -\!\left[(1-\lambda)\log p_t + \frac{\lambda}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}\log p_{\theta,t}(v)\right]$$
Motivation
Regularize overconfident predictions for better calibration.Paper ↗
Objective
$$\ell_t^\text{KL} = -\log p_t + \lambda\,\text{KL}(\pi_\text{ref}(\cdot\mid x_t)\|\pi_\theta(\cdot\mid x_t))$$
Motivation
Constrain updates with a reference model to limit drift.Paper ↗
Objective
$$\ell_t^\text{ASFT} = \ell_t^\text{DFT} + \lambda\,\text{KL}(\pi_\text{base}(\cdot\mid x_t)\|\pi_\theta(\cdot\mid x_t))$$
Motivation
Constrain updates in DFT to prevent distributional drift.Paper ↗
Objective
$$r_t = \frac{p_t}{\pi_\text{old}(y_t\mid x_t)}, \quad \ell_t^\text{PSFT} = -\min\!\left(r_t,\,\text{clip}(r_t,1-\epsilon,1+\epsilon)\right)$$
Motivation
Clip ratio to enforce updates within a trust region.Paper ↗
Objective
$$q_t(v) = \frac{\text{sg}[\pi_{\theta,t}(v)]^{1/\beta}}{\sum_{u\in\mathcal{V}}\text{sg}[\pi_{\theta,t}(u)]^{1/\beta}}$$ $$\ell_t^\text{GEM} = \text{CE}(\delta_{y_t},\pi_{\theta,t}) - \text{CE}(q_t,\pi_{\theta,t})$$
Motivation
Control probability transfer from alternatives to the observed token to preserve diversity.Paper ↗
Objective
$$\ell_t^\text{KD} = -\sum_{v\in\mathcal{V}}\pi_T(v\mid x_t)\log\pi_S(v\mid x_t)$$
Motivation
Use the teacher logit distribution as a soft target.Paper ↗
Objective
$$\ell_t^\text{KD-H} = (1-\lambda)[-\log\pi_S(y_t\mid x_t)] + \lambda\,D_\text{KL}(\pi_T(\cdot\mid x_t)\|\pi_S(\cdot\mid x_t))$$
Motivation
Combine hard-label imitation with teacher logit distribution for enriched soft supervision.Paper ↗

Design both branches of Q-target

Define proxy for $\gamma_t$ and adaptively use teacher guidance in alternatives $\tilde\pi_t$.

1Label Trust $\gamma_t$

Model probability $\pi_\theta(y_t \mid x_t)$ naturally encodes the support for $y_t$ among all plausible continuations, based on statistical evidence from pretraining. We use this as proxy for label reliability:

$$\gamma_t \;=\; \pi_\theta(y_t \mid x_t)=\; p_y \;$$
2Residual Distribution $\tilde{\pi}_t$

To preserve model prior while allowing external supervision, a teacher distribution provides reward-style signals to reshape $\pi_\theta(\cdot\mid x_t)$. This yields a teacher-guided distribution with closed form:

$$\tilde{\pi}_t^\text{guided}(a) \;\propto\; \pi_\theta(a)^{1-\eta}\,\pi_T(a)^{\eta}$$
Target-SFT Putting it together
$$Q_t^\text{TARGET} \;=\; p_y\,\delta_{y_t} \;+\; (1-p_y)\,\tilde{\pi}_t^\text{guided}$$

Trusted token gets strong supervision, with target approaches standard SFT's one-hot $\delta_{y_t}$; while uncertain token allocates higher weight to teacher-guided alternatives, approaches $\tilde{\pi}_t$. Then train with cross-entropy loss to match $Q_t^\text{TARGET}$.

Results

Target-SFT improves reasoning performance across 10 dataset-model settings on math and medical benchmarks. While the baselines fluctuate, Target-SFT consistently achieves the best Average@16 results in every setting.

Performance summary: Average@16 accuracy across all 10 dataset-model settings

BibTeX

@article{xie2026targetsft,
  title     = {A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design},
  author    = {Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh},
  journal   = {arXiv},
  year      = {2026}
}