I don't have a perfect solution, except for the obvious answer that the best we can do is a combination of multiple benchmarks. It's harder now than ever because you also want to test long contexts, and older benchmarks are over-optimized. I think it would be best to overweigh newer benchmarks and underweigh benchmarks on which objectively dumb models like Claude 3 Haiku score well. And I have a big problem with HumanEval, which is Python-only and has only 164 problems but is used as the catch-all for coding abilities.