HyperWrite, an AI company, employed A/B testing using Stripe conversion rates to evaluate different language models, prioritizing real-world performance metrics over traditional offline benchmarks. Their tests revealed that GPT-4.1 matched the conversion effectiveness of their existing model, Claude 3.5 Sonnet, while offering substantial cost savings. This approach highlights a shift in model evaluation strategies by focusing on business-relevant outcomes such as customer purchases. The methodology and insights were developed in partnership with OpenAI and detailed in a guide co-authored by HyperWrite's team, emphasizing the importance of selecting evaluation metrics aligned with actual business goals.
i collaborated with @josh_bickett and team on a guide - "how @hyperwriteAI A/B tested models and chose gpt‑4.1 -- the one that drove higher conversion rates" pick the metric you actually care about -- in this case, stripe conversion. this is how online evals should be. https://t.co/6xiwvxAg6G
We’ve spent years iterating our approach to model testing at HyperWrite. In partnership with OpenAI, @josh_bickett dives deep into how we approach this. Check out the post if you’re interested! https://t.co/yD8FOYBVDr
The Stripe eval: How @HyperwriteAI A/B tested models and chose GPT-4.1—the one that drove the most customer purchases for them: https://t.co/UEPMhf8xhx. https://t.co/Ae5sJLcrUx