
Google and researchers from Google DeepMind and Stanford University have introduced a new method called Search-Augmented Factuality Evaluator (SAFE) to evaluate the factuality of large language models (LLMs). The method combines LLMs as annotators and search engines to create a benchmark for assessing long-form factuality. The research shows that LLM agents can achieve superhuman rating performance in fact checking and are more cost-effective than human annotators. The study introduces a new dataset, evaluation method, and aggregation metric for assessing the factuality of LLMs like Gemini, GPT, and Claude.



"Long-Form Factuality in Large Language Models" introduces a new approach to evaluating and benchmarking the factuality of long-form responses generated by large language models (LLMs). Key contributions: https://t.co/61SPVtboDN
Researchers from Google DeepMind and Stanford Introduce Search-Augmented Factuality Evaluator (SAFE): Enhancing Factuality Evaluation in Large Language Models Quick read: https://t.co/anXisulDKY Researchers from Google DeepMind and Stanford University have introduced a novel…
People and companies lie about AI. https://t.co/CTFindvjC4