
Recent research papers explore the concept of scaling inference compute for large language models (LLMs) by increasing the number of generated samples per input. This method, known as repeated sampling, allows models to make multiple attempts at solving a problem, rather than relying on a single try. The approach has shown promising results, with DeepSeek-Instruct achieving 56% on SWE-Bench-Lite at 250 attempts, significantly outperforming Sonnet 3.5 while being 4.25 times cheaper. Additionally, Llama-8B can surpass 70B models when controlled for FLOPs. The studies indicate that this new dimension of scaling can enhance the performance, cost-efficiency, and coverage of LLMs.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. https://t.co/w7bLwVLZfa
Is inference compute a new dimension for scaling LLMs? In our latest paper, we explore scaling inference compute by increasing the number of samples per input. Across several models and tasks, we observe that coverage – the fraction of problems solved by at least one attempt –… https://t.co/EgpiKm5lmW
Do you like LLMs? Do you also like for loops? Then you’ll love our new paper! We scale inference compute through repeated sampling: we let models make hundreds or thousands of attempts when solving a problem, rather than just one. By simply sampling more, we can boost LLM… https://t.co/HbpzlbUR2S
