
Google and its DeepMind division, in collaboration with Stanford University, have announced a significant advancement in evaluating the factuality of content generated by large language models (LLMs) in 2024. The research introduces a new method called Search-Augmented Factuality Evaluator (SAFE), which leverages LLMs and Google Search to assess and verify the factuality of long-form content. This approach has demonstrated that LLMs can achieve superhuman performance in fact-checking, outperforming human annotators in both cost and reliability, being 20x cheaper than human annotators. The study also introduced LongFact, a prompt set designed to benchmark the factuality of LLMs across various domains. The findings suggest that larger models tend to be more factual and that LLMs can serve as efficient and cost-effective annotators. The research evaluated thirteen popular LLMs, including Gemini, GPT, and Claude, providing a comprehensive dataset and an evaluation metric that considers both precision and recall. This breakthrough could significantly impact the field of AI and fact-checking, offering a more scalable and accurate method for verifying information.
People and companies lie about AI. https://t.co/CTFindvjC4
Our new efforts is trying to address an elephant in the room for LLM: Given factuality/hallucination is so critical to the success of LLM, is there a quantitive evaluation to benchmark all existing LLMs in general? Hope our benchmark would be adopted and benchmarked as part of… https://t.co/NfkqTGRAoh
New work on evaluating long form factuality 🎉. Our method SAFE combines google search and LLM queries to extract and verify individual claims in responses. Most excitingly, we show SAFE is cheaper💰 and more reliable ✅ than human annotators. https://t.co/ulSad7fs0b


