Amazon has introduced a new framework aimed at enhancing small document understanding models through knowledge distillation from large language models (LLMs). This initiative is part of ongoing advancements in the field, which also includes a new speculative decoding method developed by Google. This method addresses limitations in on-policy knowledge distillation by utilizing both teacher and student models to generate high-quality training data in real-time, aligning with the student's inference-time distribution. Additionally, recent research highlights a full-circle evolution in LLM distillation techniques, including On-Policy KD, DistillSpec, and Speculative KD. Furthermore, improvements in diffusion models have been achieved by distilling knowledge into multiple student models, which enhances quality by specializing in data subsets and reduces latency by enabling one-step generation with smaller architectures. Finally, the concept of smart teacher intervention during knowledge distillation has been proposed to prevent student models from deviating from desired outputs, akin to having a backup teacher step in when necessary.
Smart teacher intervention during knowledge distillation prevents student models from going off track Like having a backup teacher who steps in exactly when needed 🤖 Original Problem: Knowledge Distillation (KD) for LLMs faces challenges with student-generated outputs (SGOs).… https://t.co/cMIq5GoIZK
This work improves diffusion models by distilling into multiple students, allowing (a) improved quality by specializing in data subsets and (b) improved latency by distilling into smaller models, allowing 1-step generation with smaller, lower-latency architectures. Paper:… https://t.co/yV79l8KTT2
Recent line of work on LLM Distillation, coming full circle: - On-Policy KD: https://t.co/9blyBPJZdc - DistillSpec: https://t.co/4bk7wBzCaJ - Speculative KD (SKD): https://t.co/ro4DpMAzyd https://t.co/EZe4NZTD81