
Recent advancements in large language models (LLMs) have been highlighted by various AI organizations. Gradient AI released Llama-3-8B-Instruct-Gradient-1048k, capable of handling up to 1 million tokens with improved encoding and GPU processing. Scale AI introduced a new evaluation set, GSM8k, revealing that some models like Mistral and Phi show significant overfitting, while others like GPT-4 and Claude performed well, with GPT-4-turbo scoring 84.9% and phi-3-mini at 76.3%. Meta's Llama 3 has been noted for its cost-effectiveness and strong performance, comparable to leading models like GPT-4. Additionally, new metrics and benchmarks have been established to assess the speed and efficiency of these models.







We benchmarked @Meta Llama3 @databricks DBRX @MistralAI Mixtral-8x22b and @OpenAI GPT4 on our Product Catalog Q&A dataset! 1⃣ Llama-3-70b matches GPT4's pace! 2⃣ Llama-3-8b and Mixtral-8x22b have almost identical performance. 3⃣ DBRX is definitely the chattiest of the bunch 😉 https://t.co/wkM3jEE5bL
Interesting to see people try to quantify the overfitting we see on certain LLM problems! Some models appear badly contaminated/overfit, but the flagship models - Google's Gemini, OpenAI's GPT, Anthropic's Claude - seem pretty safe, at least on elementary school level math. https://t.co/gedcLW6uk2 https://t.co/JgFXuD2dvW
Nvidia released a new model, Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). The 8b version outperforms the base llama3-70b model in ConvRAG bench. https://t.co/Eyu7BtuMio