LMSYS leaderboard is just one benchmark (that I think is fundamentally flawed). ...

arnaudsm · on April 17, 2024

Which alterative benchmarks do you recommend?

zone411 · on April 18, 2024

I don't have a perfect solution, except for the obvious answer that the best we can do is a combination of multiple benchmarks. It's harder now than ever because you also want to test long contexts, and older benchmarks are over-optimized. I think it would be best to overweigh newer benchmarks and underweigh benchmarks on which objectively dumb models like Claude 3 Haiku score well. And I have a big problem with HumanEval, which is Python-only and has only 164 problems but is used as the catch-all for coding abilities.