Recent developments in open-source research have highlighted significant advancements in training speed for language models. A new paper on NanoGPT claims a training speedup of 4 to 20 times compared to GPT, although initial attempts to reproduce these results have faced challenges due to issues with the baseline. Ongoing debates within the community focus on the effectiveness of various proposed methods for accelerating language model training. A notable benchmark was established by the modded NanoGPT repository, which reduced the training time for a GPT-2 model from 45 minutes to just 5 minutes. The latest achievement in this realm is a new NanoGPT training speed record set by a user, achieving a FineWeb validation loss of 3.28 in just 5.03 minutes, improving upon the previous record of 7.2 minutes. This record was made possible through the implementation of FlexAttention with a large sequence length.
New NanoGPT training speed record: 3.28 FineWeb val loss in 5.03 minutes Previous record: 7.2 minutes Changelog: FlexAttention with large sequence length This record is by @KoszarskyB https://t.co/gbNqYGwIg2
NanoGPT ”speedrunning” is a fascinating project, showcasing modern architecture tweaks and the Muon optimizer https://t.co/mVMoaqQguz
Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, @kellerjordan0 (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min! Love this repo 👏 600 LOC https://t.co/VTtpXbA5g8