Sep 24, 12:56 PM

New Toolbox Reveals Bias in Med-PaLM 2, Google Cloud Evaluates Meta's Llama 3.1 8B

Researchers have developed a new toolbox to identify health equity harms and biases in large language models (LLMs) such as Med-PaLM 2. This framework evaluates bias across six dimensions, including inaccuracies and stereotypical language, and was tested on over 17,000 ratings. Additionally, new LLMs like OpenAI o1 and QWEN 2.5 are being released frequently, necessitating robust benchmarks for evaluation. Tools like Alpaca-Eval, MT-Bench, and Arena-Hard-Auto are commonly used but have hidden biases. The AI Evaluation Service from Google Cloud is now being used to evaluate models like Meta's Llama 3.1 8B with Gemini 1.5 Pro as the judge, offering an alternative to human evaluations. Cloud AI Tuesday highlights these evaluations.

#OpenAI #QWEN #AI Evaluation Service #Google Cloud #Cloud AI Tuesday

Written with ChatGPT (GPT-4o).