Skip to main contentSkip to content

Robots.txt vs signed agent allowlisting: where WordPress ends and the edge begins

There is a sentence more site owners need to hear:

not every machine-access problem is a robots.txt problem.

Some problems are still clean crawl-policy decisions. Some problems are now identity and runtime-permission decisions.

That boundary matters more in 2026 because major operators now document:

  • user-triggered fetchers that generally ignore robots.txt;
  • signed agents that can be allowlisted at CDN or firewall level;
  • verified-bot categories and agent directories at the edge.

If you do not separate those layers, you either overpromise what a plugin can do or you under-control what infrastructure should actually govern.

The short version

Use the following rule of thumb.

LayerMain roleWhat it is good for
robots.txtPublic crawl policySearch crawl access, many training opt-outs, category-level path rules
Usage signals and llms.txtPublished downstream-use and reading guidanceSearch vs training vs answer posture, machine guidance, interpretation reduction
Signed-agent allowlistingVerified agent identityTrusting a specific signed agent at the edge
CDN / WAF / runtime controlsEnforcement and permissionsSession boundaries, login flows, purchases, abuse limits, bot verification

This is why The machine governance file stack exists as an architectural article and not just a settings guide.

What robots.txt can do well

robots.txt is still essential.

It is the right layer when the question is:

  • should this crawler fetch these paths;
  • should Search stay open;
  • should a training crawler be refused at the crawl-policy layer;
  • should low-value URL spaces be cut off.

That includes many of the practical choices Better Robots.txt was built to help publish:

  • search visibility;
  • training crawler posture;
  • archive boundaries;
  • SEO-tool posture;
  • crawl hygiene by path and category.

This is a publishing and policy layer.

It is not a signed-identity enforcement layer.

What changed in the agentic web

Two kinds of documentation make the boundary much clearer now.

Google’s user-triggered fetchers

Google says user-triggered fetchers generally ignore robots.txt because the fetch is initiated by a user.

It also documents Google-Agent as being used by agents hosted on Google infrastructure to navigate the web and perform actions on user request.

That means part of the Google agent problem is no longer "What does my public crawl file say?"

It is "How do I handle agent traffic that belongs to a different request class?"

OpenAI’s signed agent model

OpenAI documents ChatGPT agent as a signed and allowlistable agent in major edge ecosystems such as Akamai, Cloudflare, and HUMAN.

That is a different control world.

In that world, the relevant questions are:

  • is the agent cryptographically or operationally verified;
  • should it be trusted to pass the edge;
  • what permissions should it have once trusted;
  • what should happen at login, checkout, or other session-sensitive routes?

Those are not questions that robots.txt was ever designed to answer.

What signed-agent allowlisting actually does

Signed-agent allowlisting is about trusted identity at the edge.

That means:

  • the request is not treated as just another claimed user-agent string;
  • the CDN, WAF, or trusted bot directory is involved;
  • the system may expose explicit permissions or allow/deny behavior for that agent class.

In practice, this can include:

  • allowing a specific verified agent through the edge;
  • narrowing its route access;
  • granting or denying action permissions;
  • keeping logs and decisions tied to an authenticated or trusted bot identity.

That is a very different problem from saying:

User-agent: SomeBotDisallow: /private/

Where WordPress ends

The WordPress or publishing layer is where you should:

  • publish robots.txt;
  • publish AI usage posture;
  • publish llms.txt and related machine-readable guidance;
  • clarify category boundaries;
  • reduce contradictions across public policy surfaces.

That is the role Better Robots.txt can genuinely help with.

It is a governance and publication layer.

Where the edge begins

The edge begins when the real problem becomes:

  • verified bot identity;
  • allowlisting;
  • signed or trusted agent directories;
  • runtime permissions;
  • WAF rules;
  • session boundaries;
  • rate limiting and abuse control.

At that point, the question is no longer "What file do I publish?"

It becomes "What traffic do I trust, and what may it actually do?"

This is exactly the right moment to stop oversimplifying the problem into a robots.txt checkbox.

Three common mistakes

1. Asking robots.txt to solve signed-agent access

It cannot do that on its own.

It can still express policy and intent, but it does not perform edge verification.

2. Treating signed agents as plain user-agent strings

That throws away the very thing that makes the signed-agent model valuable: trusted identity.

3. Overpromising plugin scope

A WordPress plugin can publish a better policy. It does not magically become your CDN, WAF, identity layer, or runtime permission system.

A practical decision path

Use this sequence.

If the real issue is crawl access

Use robots.txt.

If the real issue is downstream-use posture or machine guidance

Use usage signals, policy surfaces, and llms.txt.

If the real issue is agent identity and trust

Use signed-agent allowlisting or verified-bot controls at the edge.

If the real issue is what the agent may do after access is granted

Use runtime permissions, route controls, and infrastructure enforcement.

Where Better Robots.txt fits

Better Robots.txt fits before the edge, not instead of the edge.

It helps you publish a cleaner public machine policy and a clearer governance position.

That is valuable precisely because it prevents teams from mixing:

  • crawl policy;
  • usage posture;
  • interpretation guidance;
  • infrastructure enforcement.

The more clearly those are separated, the more mature the final governance model becomes.

The correct mental model

The safest mental model is this:

  • robots.txt = public crawl intent
  • usage signals and llms.txt = public use and guidance intent
  • signed-agent allowlisting = verified identity at the edge
  • WAF / CDN / runtime controls = actual enforcement and permissions

That boundary is not a weakness of the plugin model. It is the reason a serious governance stack needs more than one layer.