
A recent study, JUDGE-BENCH, evaluated 11 large language models (LLMs) across 20 NLP datasets with human annotations. The study concluded that LLMs are not yet ready to replace human judges in natural language processing tasks. Key points for improvement include the lack of mention of Chain of Thought (CoT), few-shot prompting, multi-step evaluation methods, and ensembling. Additionally, Cohen's kappa statistic of 0.3 was questioned for its adequacy. Other evaluations highlighted that larger LLMs performed exceptionally well compared to smaller models, which struggled with various tasks. GPT-4 generally led in performance, though no single model dominated all scenarios. LangSmith's self-improving evaluators and Google's Gemini 1.5 pro, with a performance range of 37-44%, were also noted for their performance in long context tasks and information extraction.
I reviewed the dark field of LLM-as-a-judge so you don't have to. Here are the key findings. Model Performance Variability LLMs show inconsistent performance across datasets and tasks. No single model dominates all scenarios. GPT-4 generally leads, with open-source models like… https://t.co/qaRNIgGxsn
Amazing synthesis of the new potential of LLM-as-a judge/synthetic data to enchance information retrieval systems. And the best methods to get there, especially alignment/calibration on your real life queries. https://t.co/ysiAQ4qFtd
We recently benchmarked a whole bunch of small LLMs on a variety of small tasks. And, honestly, they sucked. Is there a trick to getting these things to work? The larger LLMs did exceptionally well. And the prompts were not very long.
