Search vs ai-input vs ai-train: what these signals really mean

One of the most useful improvements in machine-access governance is not a new bot name.

It is a better vocabulary.

The difference between search, ai-input, and ai-train matters because those three purposes are not the same thing:

one is about discoverability and search results;
one is about answer-time model input;
one is about training or fine-tuning.

If a site collapses those into one vague "AI allow" or "AI deny" position, it loses most of the precision modern governance now makes possible.

The short version

Here is the simplest practical interpretation.

Signal	What it means	Main business question
`search`	Search indexing and results with links and excerpts	Do I want this content discoverable in search experiences that can cite and send traffic?
`ai-input`	Inputting content into AI models at query time	Do I want this content used for retrieval, grounding, or RAG-style answer assembly?
`ai-train`	Training or fine-tuning AI models	Do I allow this content to be used to improve future models?

That distinction is far more useful than a flat "AI bot" label.

What `search` actually means

The search purpose is the easiest one to understand.

Cloudflare describes it as building a search index and providing search results with links and excerpts.

In other words, this purpose is tied to:

discoverability;
indexing;
linked results;
excerpt and citation behavior.

The core business question is usually:

Do I want this content available in search experiences that may still send people back to my site?

That makes search different from both training and answer-time retrieval.

It is about being found.

What `ai-input` actually means

ai-input is the signal many teams misunderstand first.

Cloudflare describes it as inputting content into AI models at query time, including use cases such as retrieval-augmented generation and grounding.

That means the content is not necessarily being used to train the model.

It may instead be pulled into the model’s working context when a user asks a question.

That is a very different business and governance question.

It asks:

should my content be used in real-time or near-real-time answer construction;
do I allow grounding and retrieval-style use;
do I want to permit answer-time model input even if I refuse long-term training?

This is why Why robots.txt is not enough for user-triggered AI agents and ai.txt vs robots.txt vs llms.txt belong in the same reading path.

What `ai-train` actually means

ai-train is the clearest downstream-use signal.

Cloudflare defines it as training or fine-tuning AI models.

This is the purpose that most closely maps to the familiar publisher concern:

"Do I want my content used to improve future models?"

It is not the same as search visibility. It is not the same as real-time retrieval. It is about future model development.

That is exactly why a site may rationally choose:

search=yes
ai-input=yes
ai-train=no

or any other combination that fits its business posture.

Why the distinction matters operationally

The meaning of the distinction is not theoretical.

Cloudflare’s /crawl endpoint explicitly respects Content Signals found in a target site’s robots.txt.

It documents the three purposes as:

search
ai-input
ai-train

It also explains something very revealing:

by default, /crawl declares all three purposes. If a site sets one of those purposes to no, the crawl request is rejected unless the operator narrows the declared purposes and removes the disallowed one.

That means the distinction is doing real work.

It is not just a semantic nicety. It is a machine-readable declaration about what kind of use is intended after access.

The three business questions you should actually ask

A good policy discussion becomes clearer when you ask three separate questions.

1. Do we want discoverability?

That is the search question.

2. Do we want answer-time retrieval or grounding?

That is the ai-input question.

3. Do we want future model-improvement use?

That is the ai-train question.

Many sites do not answer all three the same way.

And they should not feel forced to.

Three common mistakes

1. Treating `ai-input` as if it were just another word for training

It is not.

ai-input is about query-time use, not necessarily long-term model improvement.

2. Treating `search` as if it covered every answer-related use case

It does not.

A site may want to appear in linked search results but still reject some answer-time model ingestion or training use.

3. Reading these signals as hard technical enforcement

They are not universal enforcement.

Cloudflare explicitly describes Content Signals as trust-based declarations about intended use after access.

That is still valuable. But it is not the same thing as a firewall.

Where Better Robots.txt fits

This is one of the strongest places where Better Robots.txt makes the modern landscape easier to reason about.

It helps site owners publish a more explicit posture instead of collapsing everything into a crude allow/deny model.

That makes it easier to express a policy such as:

stay visible in search;
allow some answer-time use;
refuse training;
keep the published policy coherent across files.

That is a governance gain even when no single signal guarantees universal obedience.

The correct mental model

The safest model is this:

search = discoverability
ai-input = answer-time model use
ai-train = future model improvement

Once those are separated clearly, policy stops sounding vague and starts becoming operational.

Search vs ai-input vs ai-train: what these signals really mean ​

The short version ​

What search actually means ​

What ai-input actually means ​

What ai-train actually means ​

Why the distinction matters operationally ​

The three business questions you should actually ask ​

1. Do we want discoverability? ​

2. Do we want answer-time retrieval or grounding? ​

3. Do we want future model-improvement use? ​

Three common mistakes ​

1. Treating ai-input as if it were just another word for training ​

2. Treating search as if it covered every answer-related use case ​

3. Reading these signals as hard technical enforcement ​

Where Better Robots.txt fits ​

The correct mental model ​

Related ​