Makes sense. It just set off some statistical alarm bells in my head to see a model marked as passing with 1 trial, and some models marked as failing with 5. What if the probability of success is 5% for both models? How confident are we that our grading of the models is correct? It's an interesting problem.
The current metric is actually quite strong -- it mirrors the real-world use case of people trying a few times and being satisfied if any of them's what they're looking for. It rewards diversity of responses.
Actually, search engines do this this too: Google something with many possible meanings -- like "egg" -- on Google, and you'll get a set of intentionally diversified results. I get Wikipedia; then a restaurant; then YouTube cooking videos; Big Green Egg's homepage; news stories about egg shortages. Each individual link is very unlike the others to maximize the chance that one of them's the one you want.
Its made a little bit better by the fact that there's something like a dozen different prompts. Across all of the prompts each model had a fair number of opportunities to show off.
Cool site btw! Thanks for sharing.