
Scale AI, in partnership with AI Risks, has released the WMDP LLM benchmark, designed to evaluate potentially hazardous knowledge in Large Language Models (LLMs). The benchmark consists of 4,157 multiple-choice questions to assess the risk of LLMs aiding malicious actors. The goal is to create a standard benchmark for open-source developers to test their models against.
today in AI: 1/ @scale_AI is releasing a new benchmark: WMDP (Weapons of Mass Destruction Proxy). This benchmark would test risky information about things like biology, chemicals, and computer hacking that these models might have learned. It’s built in collaboration with the…
“‘Our hope is this becomes adopted as one of the primary benchmarks all open source developers benchmark their models against,’ [@alexandr_wang] says." @henshall_will from @TIME covers the impact of Scale and @ai_risks’s WMDP benchmark released today https://t.co/TS9UfJ81JW
tinyBenchmarks: Quick and cheap LLM evaluation! We developed ways of making cheap and reliable LLM benchmarking reducing the need for computing up to 140x (e.g., in MMLU). paper: https://t.co/CkdShZpgDg GitHub repo: https://t.co/DUHNtwjILT Thread below🧵1/5 https://t.co/NtMO0kDef8




