I have bad news for you: LLMs are not reading *llms.txt* nor *AGENTS.md* files f...

michaelcampbell · 2026-02-18T12:42:25 1771418545

I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed?

Or is this file meant to be "read" by an LLM long after the entire site has been scraped?

hamdingers · 2026-02-18T15:26:29 1771428389

Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever.

I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.

efreak · 2026-02-18T19:45:56 1771443956

What about scripted transformations? Or just add a simple timestamp to the query and only allow it to be used up to a week later? (Whether it works without the parameter could be tested too)

dumbfounder · 2026-02-18T18:33:36 1771439616

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

gamesieve · 2026-02-18T19:38:13 1771443493

If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.

hamdingers · 2026-02-18T22:24:17 1771453457

There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.

This is what you should imagine when your site is being scraped:

   def crawl(url):
    r = requests.get(url).text
    store(text)
    for link in re.findall(r'https?://[^\s<>"\']+', r):
        crawl(link)

flaburgan · 2026-02-18T23:09:28 1771456168

Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM

hamdingers · 2026-02-19T16:24:08 1771518248

Training is not inference, there is no reasoning happening then either.

Even if it did have some effect down the line it wouldn't help sites like AA with their scraping problem, which is the issue at hand.

boothby · 2026-02-18T22:58:07 1771455487

You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?

reconnecting · 2026-02-18T12:56:05 1771419365

Absolutely.

I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.

olivia-banks · 2026-02-18T21:16:24 1771449384

I assume this might be changing. Anecdotally, from what I've read here, I think we're starting to see headless browsers driven by LLMs for the purposes of scraping (to get around some of the content blocks we're seeing). Perhaps this is a solution to a problem that won't work now, but in the future, maybe.

giancarlostoro · 2026-02-18T14:38:33 1771425513

I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news.

cardanome · 2026-02-18T09:11:59 1771405919

Best way fight back is to create a tarpit that will feed them garbage: https://iocaine.madhouse-project.org/

bee_rider · 2026-02-18T20:07:06 1771445226

This is a file for a LLM, not a scraper, so anti-scraping mitigations seem sort of beside the point.

jacquesm · 2026-02-18T14:46:48 1771426008

And to try to get them execute bb(5) ;)

joquarky · 2026-02-18T19:49:32 1771444172

claude --plan "let's develop a plan to detect and mitigate tarpits"

Ten minutes later, the ball is back in your court.

epidemian · 2026-02-18T23:29:51 1771457391

Do you think an LLM would be able to generate a solution to a novel problem just like that?

That doesn't match my (albeit limited) experience with these things. They are pretty good at other things, but generally squarely in the real of "already done" things.

blargey · 2026-02-19T01:22:47 1771464167

Anti-crawler tarpits and related concepts have existed for decades already; LLM training data is only the latest and most popular of web-scraping goals.

Claude is happy and able to provide a laundry list of ways to mitigate the impact of tarpits on your crawler, and politeness / respecting robots.txt is only one of them.

hiccuphippo · 2026-02-18T17:05:02 1771434302

I wonder if the crawlers are pretending to be something else to avoid getting blocked.

I see Bun (which was bought by Anthropic) has all its documentation in llms.txt[0]. They should know if Claude uses it or wouldn't waste the effort in building this.

[0] https://bun.sh/llms.txt

CognitiveLens · 2026-02-18T19:14:17 1771442057

As a project that started with a lot of idealism about how software _should_ be built, I would totally expect Bun to have an llms.txt file even if Claude wasn't using it. It's a project that is motivated in part by leading by example.

reconnecting · 2026-02-18T17:09:06 1771434546

I also noticed this LLMs.txt at bun.sh, so for me it looks like some sort of advertising.

post-it · 2026-02-18T22:59:59 1771455599

Optimistic to assume the Bun team and the Claude team talk to each other

nozzlegear · 2026-02-18T19:17:05 1771442225

Did they do that before they were bought by Anthropic? Perhaps it's just part of a CI process that nobody's going to take an axe to without good reason.

jph00 · 2026-02-18T17:09:11 1771434551

llms.txt files have nothing to do with crawlers or big LLM companies. They are for individual client agents to use. I have my clients set up to always use them when they’re available, and since I did that they’ve been way faster and more token efficient when using sites that have llms.txt files.

So I can absolutely assure you that LLM clients are reading them, because I use that myself every day.

reconnecting · 2026-02-18T17:24:34 1771435474

Thanks for the clarification.

>for use in LLMs such as Claude (1)

From your website, it seems to me that LLMs.txt is addressed to all LLMs such as Claude, not just 'individual client agents' . Claude never touched LLMs.txt on my servers, hence the confusion.

1. https://llmstxt.org

GaggiX · 2026-02-18T09:25:01 1771406701

This is meant for openclaw agents, you are not gonna see a ChatGPT or Claude User-Agent. That's why they show it in a normal blog page and not just as /llms.txt

reconnecting · 2026-02-18T09:32:33 1771407153

In tirreno (our product), we catch every resource request on the server side, including LLMs.txt and agents.md, to get the IP that requested it and the UA.

What I've seen from ASNs is that visits are coming from GOOGLE-CLOUD-PLATFORM (not from Google itself), and OVH. Based on UA, users are: WebPageTest, BuiltWith, and zero LLMs based on both ASN and UA.

1. https://github.com/tirrenotechnologies/tirreno

GaggiX · 2026-02-18T09:38:26 1771407506

Openclaw agents use the same browser and ASN that me and you use, also the llms.txt (as shown) is displayed as a normal blog page so it can be discover by the agents without having to fetch /llms.txt at random.

reconnecting · 2026-02-18T09:43:24 1771407804

When I look at LLMs.txt, I see every request and there are no ASNs from residential networks or browsers UA.

GaggiX · 2026-02-18T10:04:17 1771409057

For the third time I'm telling you on Anna’s Archive they have displayed the llms.txt as a standard blog page, not hidden in /llms.txt, so that agents can notice it without having to fetch /llms.txt at random. That's why it's meant for openclaw agents and not openai/anthropic crawlers.

supermatt · 2026-02-18T10:37:07 1771411027

I don’t understand your reasoning.

Are you suggesting that openclaw will magically infer a blog post url instead? Or that openclaw will traverse the blog of every site regardless of intent?

Anyway, AA do provide it as a text file at /llms.txt, no idea why you think it is a blog post, or how that makes it better for openclaw.

GaggiX · 2026-02-18T11:30:39 1771414239

>AA do provide it as a text file at /llms.txt, no idea why you think it is a blog post

It's a blog post, it's shown as the first item in Anna’s Blog right now, and as I said in my first comment it's also available as /llms.txt

>Are you suggesting that openclaw will magically infer a blog post url instead? Or that openclaw will traverse the blog of every site regardless of intent?

If an openclaw decide to navigate AA it would see the post (as it is shown in the homepage) and decide to read it as it called "If you’re an LLM, please read this'.

reconnecting · 2026-02-18T10:13:15 1771409595

My point is about LLM crawlers specifically.

PathfinderBot · 2026-02-18T10:34:21 1771410861

LLM crawlers aren't really a thing, at least not in the "they have agency over what they're crawling and read what they crawl" way.

whazor · 2026-02-18T09:30:26 1771407026

what if you add a  to every .html

reconnecting · 2026-02-18T09:41:05 1771407665

Actually, I noticed an interesting behaviour in LLMs.

We had made a docs website generator (1) that works with HTML (2) FRAMESET and tried to parse it with Claude.

Result: Claude doesn't see the content that comes from FRAMESET pages, as it doesn't parse FRAMEs. So I assume what they're using is more or less a parser based on whole-page rendering and not on source reading (including comments).

Perhaps, this is an option to avoid LLM crawlers: use FRAMEs!

1. https://github.com/tirrenotechnologies/hellodocs

2. https://www.tirreno.com/hellodocs/

rep_lodsb · 2026-02-18T15:13:14 1771427594

With the WWW, from here on out and especially in multimedia WWW applications, frames are your friend. Use them always. Get good at framing. That is wisdom from Gary.

The problem most website designer have is that they do not recognize that the WWW, at its core, is framed. Pages are frames. As we want to better link pages, then we must frame these pages. Since you are not framing pages, then my pages, or anybody else's pages will interfere with your code (even when the people tell you that it can be locked - that is a lie). Sections in a single html page cannot be locked. Pages read in frames can be.

Therefore, the solution to this specific technical problem, and every technical problem that you will have in the future with multimedia, is framing.

Frames securely mediate, by design. Secure multi-mediation is the future of all webbing.

giancarlostoro · 2026-02-18T14:30:22 1771425022

If they run across a blog post pointing to it, they might. Did you test that?

Edit: Someone else pointed out, these are probably scrapers for the most part, not necessarily the LLM directly.

joquarky · 2026-02-18T19:53:51 1771444431

It would be foolish to use the LLM directly without a wrapper that detects prompt injection attempts.

bee_rider · 2026-02-18T20:12:15 1771445535

I think this is trying to appeal to the sort of agentic/molt-y type systems that recently became popular. Their whole thing is that they can modify their “prompts” in some way.

cactusplant7374 · 2026-02-18T18:37:18 1771439838

It sounds really expensive to run inference as a crawler.

mancerayder · 2026-02-19T02:17:52 1771467472

Now we get into a future legal problem for someone to argue back and forth:

The LLM agents behave like people. People read web pages, never reading agents.nd or of course llms.txt. Are they legally scrapers or something more like Selenium agents that simulate people and that's okay? I know which one I think is true.

chrisjj · 2026-02-18T18:43:28 1771440208

Doesn't sound like bad news to me.

Anything that reduces the load impact of the plagaristic parrots is a good thing, surely.

cratermoon · 2026-02-19T01:35:29 1771464929

Make them request it. Put a link to it on every page served from your site, in the footer or sidebar. Make the text or icon for the link invisible to humans by making the text color the same as the background and use the smallest point size you can reasonably support.

Spivak · 2026-02-18T20:31:55 1771446715

And they probably shouldn't. I think it's a premature optimization to assume LLMs need their own special internet over markdown when they're perfectly capable of reading the HTML just fine.

Why maintain two sets of documentation?

Sharlin · 2026-02-18T13:50:05 1771422605

You could insert the message on every single webpage you serve, hidden visually and from screenreaders.

gooob · 2026-02-18T16:18:56 1771431536

wait why not robots.txt?

reconnecting · 2026-02-18T16:29:38 1771432178

Good question, at least OAI-SearchBot is hitting robots.txt.

I assume the real issue is that what overloads the servers like security bots, SEO crawlers, and data companies — are the ones that don't respect robots.txt in full, but they wouldn't respect LLMs.txt either.

alterom · 2026-02-18T19:08:15 1771441695

>I have bad news for you: LLMs are not reading llms.txt

...Which is why this is posted as blog post.

They'll scrape and read that.