We tried out the brand new GPT-4.1 on one of our internal agent benchmarks and the results were good! A substantial ~10% improvement against 4o by itself, and a ~2% improvement on our already-excellent agentic approach. If you'd like to learn more about what we're working on, https://t.co/R0VY3LdDOY
h/t @noahmacca for the excellent GPT 4.1 prompting guide https://t.co/AGiVslbzpb
How good is the new GPT-4.1? This is a pretty awesome new "benchmark" from OpenAI How well does the model perform on one-shotting a webapp for creating flashcards? gpt-4o on the left, gpt-4.1 on the right https://t.co/7PEq6f7Mxm
The latest version of OpenAI's language model, GPT-4.1, has reportedly doubled accuracy on a challenging SQL generation evaluation set used to assess real-world analytical performance. This improvement was highlighted by multiple sources, including Hex, which emphasized the model's enhanced capabilities in complex language model evaluations. Additionally, other benchmarks indicate that GPT-4.1 has achieved approximately a 10% improvement over its predecessor, GPT-4o, and a 2% enhancement on an already strong agentic approach. These advancements suggest a notable leap in the model's performance, particularly in generating accurate SQL queries and supporting web applications for tasks such as creating flashcards.