Some thoughts on style de-bias & ChatBotArena. This is a great addition. 1. How strong of a correction is this? Is style now ignored or anti correlated? What is the right balance? 2. How do we create a better default ChatBotArena leaderboard that is a composition of Harder,… https://t.co/7LyEHpwihL
Human Eval Benchmark Was Being Gamed By Style Hacking... Some of us have long maintained that LLMs are being gamed for human preference through lengthy and well-formatted responses. We have also observed that OAI and Google do most of the gaming. Today, Lmsys (the human eval… https://t.co/2vOoqST6S6
Style over substance in Chatbot Arena? Check out our latest work with @LiTianleli @ml_angelopouloson to decouple them with cool statistical techniques for controlling style variables! https://t.co/lQpTA0YLpY

Lmsys has announced a significant update to its Chatbot Arena, introducing style control in its regression model. This development aims to separate the impact of style from substance in chatbot responses. The update, which involves adding style as a feature in logistic regression, is expected to address concerns that large language models (LLMs) are being manipulated to favor lengthy and well-formatted responses. Collaborators on this project include researchers Li Tianleli and Angelo Polous, who have utilized statistical techniques to decouple style from substance. The next phase of the project will focus on causal inference, with an invitation for causal experts to contribute. Observations suggest that major players like OpenAI and Google have been prominent in this 'style hacking' phenomenon, raising questions about the integrity of human evaluation benchmarks in AI.
