Researchers have developed a new toolbox to identify health equity harms and biases in large language models (LLMs) such as Med-PaLM 2. This framework evaluates bias across six dimensions, including inaccuracies and stereotypical language, and was tested on over 17,000 ratings. Additionally, new LLMs like OpenAI o1 and QWEN 2.5 are being released frequently, necessitating robust benchmarks for evaluation. Tools like Alpaca-Eval, MT-Bench, and Arena-Hard-Auto are commonly used but have hidden biases. The AI Evaluation Service from Google Cloud is now being used to evaluate models like Meta's Llama 3.1 8B with Gemini 1.5 Pro as the judge, offering an alternative to human evaluations. Cloud AI Tuesday highlights these evaluations.
When it comes to evaluating #LLMs, how does a human differ from LLM-as-a-Judge? 🧐 In our blog, we explore the ways in which LLM-as-a-Judge can offer an alternative to human evaluations. Read more ➡️ https://t.co/XAZmCRZGg9 #AI
How can you easily evaluate open LLMs using LLM as a Judge? 🤔 This week, Cloud AI Tuesday shows you how to evaluate @AIatMeta Llama 3.1 8B using the new @googlecloud AI Evaluation Service with Gemini 1.5 Pro as the judge. 🔥 TL;DR: 🚀 Deploy Llama 3.1 8B on Vertex AI using… https://t.co/1blUC4u6Ih
📢 New paper: With new LLMs like OpenAI o1 and QWEN 2.5 releasing almost every week, robust benchmarks we can run locally are incredibly key! 🧑⚖️LLM-judges🧑⚖️ like Alpaca-Eval, MT-Bench and Arena-Hard-Auto are used most often. Unfortunately, they have ⚖️ *hidden biases* ⚖️ ... 1/n