Distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's output probabilities, not just the final answers.
You can distill your fine-tuned B model into a B model that runs x faster. Quality drops, but often less than you'd expect.
Distillation works well when you've fine-tuned a large model and need faster inference. Train the student on the same data, using teacher outputs as soft targets.