Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Makes sense. It just set off some statistical alarm bells in my head to see a model marked as passing with 1 trial, and some models marked as failing with 5. What if the probability of success is 5% for both models? How confident are we that our grading of the models is correct? It's an interesting problem.

Cool site btw! Thanks for sharing.



The current metric is actually quite strong -- it mirrors the real-world use case of people trying a few times and being satisfied if any of them's what they're looking for. It rewards diversity of responses.

Actually, search engines do this this too: Google something with many possible meanings -- like "egg" -- on Google, and you'll get a set of intentionally diversified results. I get Wikipedia; then a restaurant; then YouTube cooking videos; Big Green Egg's homepage; news stories about egg shortages. Each individual link is very unlike the others to maximize the chance that one of them's the one you want.


Its made a little bit better by the fact that there's something like a dozen different prompts. Across all of the prompts each model had a fair number of opportunities to show off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: