The original version of This story appeared in How many magazine.
The Chinese of the Deepseek company released a earlier this year called R1, which drew huge attention. Monl Concentrated on the fact That a relatively small and unknown business said that it had built a chatbot that rivaled with the performance of those of the most famous AI companies in the world, but use a festive of computer power and cost. As a response, the actions of many Western technological companies have dropped; Nvidia, who sells the chips that perform the main models of AI, Lost more action value in a single day Thank you any business in history.
Part of this attention involution is an accusation. Alleged sources that Deepseek had obtainedWithout authorization, knowledge of the OPNAI private O1 model using a technique known as the distillation. A large part of the coverage of the news Through this possibility as a shock for the AI IA industry, which implies that Deepseek had discovered a new, more effective way to build AI.
But distillation, also called knowledge distillation, is a tool widely used in AI, a computer research subject dating back in decade and a tool that large technological companies use on their own models. “Distillation is one of the important tools of the MOI that companies have a total to make the models more effective,” said Enric Boix-AdseraIn the researcher who studies distillation at the Wharton school of the University of Pennsylvania.
Dark knowledge
The idea of distillation started with at 2015 paper By three researchers from Google, including Geoffrey Hinton, the supposed godfather of AI and 2024 Nobel graduate. At the time, researchers often directed sets of models – “Gluade models together,” said Vinyals OrolTo the main scientist of Google Deepmind and one of the authors of the article – to improve their performance. “But it was incredibly heavy and expensive to perform all the models in parallel,” said Vinyals. “We were intrigued by the idea of distilling this on a single model.”
The researchers although Thour could progress by approaching a notable weak point in automatic learning algorithms: bad answers were all considered just as bad, whatever their error. In an image classification model, for example, “confusing a dog with a fox was criminalized in the same way as confusing a dog with pizza,” said Vinyals. The researchers suspected that the overall models contained information on bad answers were worse than the others. Perhaps for the “Student” Londler model could use information from the large “teacher” model to grasp the categories in which he was supposed to sort the images. Hinton called this “dark knowledge”, invoking an analogy with cosmological dark matter.
After having discussed this possibility with Hinton, the vinyals have developed a way to pass more information on the categories of images on a longer student model. The key was to settle on “soft targets” in the teachers’ model – where it attributes a probability to each possibility, rather than this company responds to this company. A model, for example, Calculated That there was 30% a variation that an image showed a dog, 20% that he showed a cat, 5% that he showed a cow and 0.5% that he showed a car. Using these probabilities, the teacher model actually revealed to the student that dogs are quite similar to cats, not so different from the cows and quite distinct from the cars. The researchers found that this information would help the student learn more effectively to identify images of dogs, cats, cowns and cars. An important and complicated model could be reduced to a leaner model, without any loss of precision.
Growth explosive
The idea was not an immediate success. The newspaper was rejected at a conference, and the discouraged vinyals turned to other subjects. But distillation comes at an important moment. At that time, engineers discovered that the more training data fueled intral networks, the more effective these networks became. The size of the models has quickly exploded, just like their abilityBut the cost of the executions has climbed in step with their size.
Many researchers have turned to distillation as a way to make LONTERR models. In 2018, for example, Google researchers unveiled a powerful language model called BertThat the company quickly started to use to help analyze billions of research on the web. But Bert was large and expensive to manage, so the following year, other developers were distracted in the Smeserique Name Distilbert version, which has become widely used in business and research. Distillation has gradually become omnipresent, and it is now offered as a service by companies such as Google,, OPENAIAnd Amazon. The original distillation document, still published only on the Arxiv.org pre-impressive server, has now Been cited more than 25,000 times.
Consider that distillation requires access to the Allards of the Teacher model, it is not possible for a third party to sly distill data from a closed source model like O1 of Openai, as does Deepseek. That said, a student model of Couuld always learns a bit of a teacher model just by inviting the teacher with certain questions and using the answers to train his own models – an almost socratic approach to distillation.
Meanwhile, other researchers continue to find new applications. In Janogy, the Novasky laboratory in UC Berkeley Have shown that distillation works well for the formation of chain chain reasoning modelsWho use “thoughts” in several steps to better answer complicated questions. The laboratory says that its entirely open source Sky-T1 model cost less than $ 450 to train, and it obtained similar results to a much larger open source model. “We were really surprised by the way the world of distillation in this context,” said Dacheng Li, In Berkeley Doctoral Student and Co-Student of the Novasky team. “Distillation is a fundamental technique of AI.”
Original story Reprint with the permission of How many magazine,, an independent editorial publication of Simons Foundation Whose mission is to improve the public participation of science by covering the developments of research and the trends of mathematics and physical sciences and life.