
Recent research has unveiled a counterintuitive finding in the field of artificial intelligence: increasing the number of calls to large language models (LLMs), such as ChatGPT, does not necessarily improve the performance of compound AI systems. This discovery challenges the prevailing assumption that more LLM interactions, through a sample+filter approach, would lead to better outcomes. Researchers have begun to explore the scaling properties of these systems, both theoretically and empirically, to determine the optimal number of LLM calls. The study highlights a non-monotonic relationship between the number of LLM calls and system performance, suggesting that while more calls may benefit simpler tasks, they could hinder performance on more complex problems. Additionally, the research introduces 'tinyBenchmarks', a method for cheap and reliable LLM benchmarking that significantly reduces the need for computing power, by up to 140x, in tasks like MMLU. This insight has significant implications for the development and optimization of AI systems, prompting a reevaluation of strategies for integrating LLMs into compound AI architectures.
Wow! This is cool, for anyone using Self-consistency-like methods. This paper discovers a non-monotonic relationship between the number of LLM calls and performance. Contrary to what one might intuitively expect, making more calls to a LLM not always = performance improvements https://t.co/dYcPjdl5wu
tinyBenchmarks: Quick and cheap LLM evaluation! We developed ways of making cheap and reliable LLM benchmarking reducing the need for computing up to 140x (e.g., in MMLU). paper: https://t.co/CkdShZpgDg GitHub repo: https://t.co/DUHNtwjILT Thread below🧵1/5 https://t.co/NtMO0kDef8
This surprised us. If you call #chatgpt multiple times and take the consensus, the quality of the answer can get worse as the # of #LLM calls increases. We explain why this happens in our new paper and also show how to estimate the optimal # of LLM calls https://t.co/bqy2IPGb2e https://t.co/s5ZP0avdUl https://t.co/rTDV3RayiG


