I’ve recently been looking into knowledge distillation (KD) [1], and I think it’s a topic that seems to be underutilized in the LLM space. As a quick recap, soft-target (as opposed to hard-target) KD involves training a student model on the next token probability distributions (logits) predicted by a teacher model, rather than just the single ‘hard target’ (sampled) next token typically used when training on synthetic data. So in comparison to KD, standard synthetic data training pipelines throw out what could be valuable information (token probabilities) that could align the student model more closely with the teacher. These subtle distribution-level features, often encoded in relative probabilities in the long tail, are known as “dark knowledge” and are believed to be very helpful in helping the student LLM mimic the teacher LLM’s behavior.
In fact, one could go further and align not only the probability distributions but the intermediate layers of the student and teacher models to potentially capture internal processes, but this requires some sort of layer mapping between different architectures and is decidedly less straightforward than matching probability distributions. Logit based KD has also seen more success than intermediate layer based KD in practice. In a comparison by Arcee-AI [2] conducting SFT on the OpenHermes dataset, logit KD generally produced better results than hidden state (intermediate layer) KD, and both performed significantly better than “vanilla” SFT. I will focus on logit-based KD for the rest of this post.
The most successful models trained by (logit) KD are from Google: all Gemma [3] and Gemini Flash models are pretrained using various KD techniques and are among the very best of their size. Thus, KD has been proven to be highly effective both in large-scale pretraining runs and in smaller-scale SFT, and only involves not throwing away data compared to standard SFT (so in a sense, we can get it “for free” if the logits can be stored). So why is it not much more widely adopted? There are a few very important reasons I can think of, but I still believe that there are increasingly many cases where it should be used and it can win.
(1) KD is unfeasible on API models. For example, OpenAI’s API only allows outputting the 5 top logprobs, which is not enough to capture subtle distribution-level features for KD. One could also train using these small number of top-k logprobs and accept that we can’t get everything, but that has its own issues, as I will explain more in point (3b). In comparison, synthetic data for SFT has historically been generated largely using these API models due to their frontier capabilities—however, there have recently been open-weights model releases (DeepSeek) that can compete with closed API models, making them attractive for KD prospects.
(2) Logits are tokenizer-specific. For one to match probability distributions between student and teacher models, their vocabularies need to match. Thus, one cannot straightforwardly perform KD on models with different tokenizers, so we are basically limited to distilling on models in the same family. This is not necessarily crippling, as families such as Llama and Qwen include multiple models of different sizes, specializations, and capabilities (plus the zoo of merges and finetunes and crazy experiments on Huggingface), but it does greatly constrain the options for student-teacher pairs. There are ways to get around this, however: the previously mentioned hidden state distillation allows for distilling between different model families as they align intermediate layers, not outputs; and there are also principled methods of map tokens between different tokenizers to allow for logit distillation when the straightforward approach doesn’t work. This latter area seems to be a topic of recent interest. For example, these two papers are different ways people have reportedly bridged the tokenizer gap (through approximate likelihood matching and "optimal transport respectively). I admit I haven’t yet read them in detail and can’t comment on how good and practical they are… one can hope that byte-level LLMs will be a thing one day and we won’t have to worry about tokenizers at all…
(3) The computational barriers are significant. There are generally two ways KD can be done: online and offline.
(a) Online KD involves inferencing the teacher model concurrently with training the student model on the inferenced results: it has the benefit of not needing to save large amounts of logits into storage as well as having more flexible teacher model pipelines—e.g. the capability to easily perform on-policy distillation [4] as the student model is being trained, a process where the student model learns from the teacher model’s feedback on its self-generated sequences (this was leveraged for the training of Gemma). However, online KD necessitates loading and inferencing the larger teacher model at the same time as training the student model, which makes the entire process slower and more intensive and also makes it impossible to train over multiple epochs unless the logits are also saved (in which case it then turns into offline KD).
(b) Offline KD involves inferencing the teacher model separately, saving the token probabilities, and then separately training on those probabilities. This makes the training less intensive than online KD, but it comes with the major problem of storage. Modern LLMs have vocabulary sizes of over 100k, which adds up very quickly if they are all to be saved. Let’s do a back-of-the-envelope calculation: take the Llama 3 family’s vocab size of 128,256 and assume we are performing SFT over the Hermes-3 dataset of 120 million input tokens and 270 million output tokens (=390 million total tokens).
Saving logits for KD:
270 million output tokens x 128256 indices x 2 bytes/index (assume fp16) = 63TiB. This is a lot! Storing and loading this is quite the task.
In comparison, normal SFT:
390 million tokens x 4 bytes/tok (assume int32) = 1.45GiB. This is perfectly reasonable, and more than 44,000x less!
Luckily, there are also ways around these huge storage requirements. The first instinct one might have is that saving the logprobs for all 100k+ indices is absolutely overkill and that one can capture almost everything by including some tiny number (tens?) of top-k (or top-p) tokens out of the vocabulary; one can then run training on only these indices for each token while training the student model. Unfortunately, it seems that top-k and top-p strategies do not work very well compared to full KD because they introduce a bias by abruptly cutting off all information in the long tail of the probability distribution [5]. A solution to this has been introduced in the form of random sampling knowledge distillation (RS-KD), where the full probability distribution is Monte Carlo sampled to create a noisy, approximate, theoretically unbiased sparse distribution; depending on the number of samples, this sparse distribution can take up very little storage (12 indices per token in the paper) and still perform on par with full KD because, when averaged over many training steps, it is expected to behave exactly the same as if the exact distribution were provided. Thus, offline RS-KD seems to be the best of both worlds in that it bears significantly reduced additional cost compared to full KD, often approaching the cost of standard SFT with synthetic data (Monte Carlo sampling costs almost nothing and storage costs are only slightly larger) while retaining the benefits of soft-target KD.
Taking all of these into account, I am currently testing the performance of offline RS-KD in the specific context of chain of thought reasoning models. I am choosing this problem because reasoning can significantly increase the capabilities of a base model, and reasoning can be learned either using RL or training on chain of thought traces (hard-target SFT). DeepSeek, in training their R1 distilled models [6] (“distilled” using hard targets, not KD), claimed that the latter approach was more effective and that smaller models learned reasoning capabilities from larger models more efficiently than they could achieve with RL themselves. The question then becomes whether KD, which can capture subtler features (including probability spaces for reasoning) compared to standard SFT, can impart reasoning capabilities into smaller models even more effectively than DeepSeek’s approach. My current experiment is with the Qwen family (sticking to one family because using cross-tokenizer methods adds too much complexity), using the reasoning LLM QwQ-32B as the teacher and a smaller Qwen model (7B or 14B) as the student, and with competition math as the first target (and NuminaMath-CoT as the training dataset). I have chosen QwQ mainly due to its size (32B): compared to giant models like DeepSeek, it is obviously more economical to run while remaining highly capable, and moreover, its smaller size may actually be an advantage in distillation. This is because it is often more effective to distill from smaller teachers than larger ones—in training Gemma 3, it was found that using a smaller teacher led to greater performance for up to tens of billions of training tokens (after which the larger teacher prevailed, but tens of billions is in the realm of pretraining, not SFT). Another observation is that QwQ is often much more verbose than larger reasoning models, checking and double checking again (“wait,” “alternatively”), which might be a necessary feature of a reasoning model of a smaller size (hence even more reason to distill a smaller student model with QwQ as teacher)—potentially further evidenced by the various test-time compute scaling papers where tiny CoT models claim to perform excellently on certain tasks after constantly questioning their results. In my own experiments, the math reasoning traces are, in many positions, extremely sharply peaked distributions with almost unity probability for a single token: this is absolutely expected as math can be precise compared to normal language—if I previously had “x+1=2” and my text is now “x=”, it very much should always predict “1”. I am curious about how this will affect KD, as now the token probability space might conceptually resemble “discrete branches” of deductions instead of “continuous branches” where every token can diverge, with some of the branching corresponding to the "wait"s and "alternatively"s and the model’s judgment and capability to double check and verify. Maybe this capability can be distilled into the student?
[1] Distilling the Knowledge in a Neural Network: Hinton’s original paper introducing KD (not in an LLM context)
[2] DistillKit by Arcee-AI: a KD codebase that has been used to distill small models, with comparisons between different techniques
[3] [2503.19786] Gemma 3 Technical Report: technical report on the excellent Gemma 3 models, with a review of the different KD approaches they used
[4] [2306.13649] On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes: a paper on an alternative KD pipeline where the teacher model evaluates the student’s self-generated sequences.
[5] [2503.16870] Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs: introduces RS-KD and provides various benchmarks about how different approaches and hyperparameters perform in KD.
[6] [2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: The DeepSeek-R1 paper introducing lots of RL and reasoning stuff like GRPO, but also includes some results and commentary on distilled models