Thank you for continuing to maintain the only benchmarking system that matters! ...

gabiruh · 2026-02-11T19:38:40 1770838720

It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

btown · 2026-02-11T21:44:37 1770846277

If you were a pelican, wouldn't you want to go cycling on a sunny day?

Do electric pelicans dream of touching electric grass?

Magniquick · 2026-02-12T03:37:38 1770867458

Do electric pelicans dream of touching electric grass?

That would be shocking news to me.

davidwritesbugs · 2026-02-12T07:49:39 1770882579

Please leave the Internet :)

derefr · 2026-02-11T21:49:04 1770846544

It is odd, yeah.

I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

l_eo · 2026-02-11T20:55:29 1770843329

They will start to max this benchmark as well at some point.

ljm · 2026-02-11T22:25:50 1770848750

It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

tylervigen · 2026-02-11T22:32:56 1770849176

For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

vidarh · 2026-02-12T07:58:53 1770883133

RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.

I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.

interstice · 2026-02-11T22:36:56 1770849416

So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

9dev · 2026-02-12T11:19:29 1770895169

Simon has written a page specifically for you: https://simonwillison.net/2025/nov/13/training-for-pelicans-...

segmondy · 2026-02-12T01:01:04 1770858064

This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.