
Magic AI Labs has introduced a new evaluation method called HashHop, designed to address the limitations of existing long-context evaluations. The HashHop algorithm allows for a context window of 100 million tokens while utilizing a fraction of the memory of a single H100 GPU. This innovative approach has garnered attention for its potential to enhance the evaluation of large context models. The team at Magic AI Labs has received praise for their work, which includes a blog post detailing the weaknesses of popular long-context evaluation methods and showcasing HashHop as a viable alternative. The new method is anticipated to replace outdated benchmarks, such as the 'Needle In A Haystack' test, with more effective standards for assessing long context windows.


You know something is good when it aces existing tests and has to invent its own benchmark to flex. HashHop is a step forward, and I hope it becomes the norm, replacing the stupid "Needle In A Haystack" test for benchmarking long context windows. https://t.co/42JpYoPAa8
The Hashhop method smells nice. Really need that kind of eval for large context evaluation. The 100M context window in fraction of single H100's memory is insane. My bet is on some infra wizardry. inference-time compute is free intelligence and that part is not surprising. https://t.co/v6XXDFURxY https://t.co/Jksta5SWIC
HashHop a pretty interesting synthetic data/eval method, similar to what I've thought of, and it doesn't smell like BS to me. I have no idea how they can achieve what they claim here, but I'll withhold judgement until some credible people report on downstream experience. https://t.co/wKyiFcUnk4 https://t.co/FpHEAGCpNd