How can it be 99.2% on GSM8k? Does it also solve the noisy questions? 😀 Some one should do a robustness study by paraphrasing the questions, or just changing numbers!!! https://t.co/cJZoEfGBc8
couldn't try it properly because of traffic but 70B tune isn't beating 405B overall. Nope. this is more like an advertisement to me. 99.2% accuracy on gsm8k is actually a bad signal. https://t.co/uLRRCgtE6g
I read some GSM8K test examples myself and I find it hard to believe only 1% of them are wrong/questionable. https://t.co/wqQVOF9S9X
Recent discussions on social media have raised concerns about the validity of the GSM8K evaluation metric, which reportedly has achieved a 99.2% accuracy rate. Some experts believe that this high accuracy may be indicative of data leakage and overfitting. Lukasz Kaiser expressed skepticism about the accuracy, finding it hard to believe that only 1% of the test examples are questionable. Another user suggested that the 99.2% accuracy might actually be a bad signal, while Swaroop Rm7 called for a robustness study involving paraphrasing questions or changing numbers to verify the results.