Skip to main contentSkip to content

Search vs ai-input vs ai-train: what these signals really mean

One of the most useful improvements in machine-access governance is not a new bot name.

It is a better vocabulary.

The difference between search, ai-input, and ai-train matters because those three purposes are not the same thing:

  • one is about discoverability and search results;
  • one is about answer-time model input;
  • one is about training or fine-tuning.

If a site collapses those into one vague "AI allow" or "AI deny" position, it loses most of the precision modern governance now makes possible.

The short version

Here is the simplest practical interpretation.

SignalWhat it meansMain business question
searchSearch indexing and results with links and excerptsDo I want this content discoverable in search experiences that can cite and send traffic?
ai-inputInputting content into AI models at query timeDo I want this content used for retrieval, grounding, or RAG-style answer assembly?
ai-trainTraining or fine-tuning AI modelsDo I allow this content to be used to improve future models?

That distinction is far more useful than a flat "AI bot" label.

What search actually means

The search purpose is the easiest one to understand.

Cloudflare describes it as building a search index and providing search results with links and excerpts.

In other words, this purpose is tied to:

  • discoverability;
  • indexing;
  • linked results;
  • excerpt and citation behavior.

The core business question is usually:

Do I want this content available in search experiences that may still send people back to my site?

That makes search different from both training and answer-time retrieval.

It is about being found.

What ai-input actually means

ai-input is the signal many teams misunderstand first.

Cloudflare describes it as inputting content into AI models at query time, including use cases such as retrieval-augmented generation and grounding.

That means the content is not necessarily being used to train the model.

It may instead be pulled into the model’s working context when a user asks a question.

That is a very different business and governance question.

It asks:

  • should my content be used in real-time or near-real-time answer construction;
  • do I allow grounding and retrieval-style use;
  • do I want to permit answer-time model input even if I refuse long-term training?

This is why Why robots.txt is not enough for user-triggered AI agents and ai.txt vs robots.txt vs llms.txt belong in the same reading path.

What ai-train actually means

ai-train is the clearest downstream-use signal.

Cloudflare defines it as training or fine-tuning AI models.

This is the purpose that most closely maps to the familiar publisher concern:

"Do I want my content used to improve future models?"

It is not the same as search visibility. It is not the same as real-time retrieval. It is about future model development.

That is exactly why a site may rationally choose:

  • search=yes
  • ai-input=yes
  • ai-train=no

or any other combination that fits its business posture.

Why the distinction matters operationally

The meaning of the distinction is not theoretical.

Cloudflare’s /crawl endpoint explicitly respects Content Signals found in a target site’s robots.txt.

It documents the three purposes as:

  • search
  • ai-input
  • ai-train

It also explains something very revealing:

by default, /crawl declares all three purposes. If a site sets one of those purposes to no, the crawl request is rejected unless the operator narrows the declared purposes and removes the disallowed one.

That means the distinction is doing real work.

It is not just a semantic nicety. It is a machine-readable declaration about what kind of use is intended after access.

The three business questions you should actually ask

A good policy discussion becomes clearer when you ask three separate questions.

1. Do we want discoverability?

That is the search question.

2. Do we want answer-time retrieval or grounding?

That is the ai-input question.

3. Do we want future model-improvement use?

That is the ai-train question.

Many sites do not answer all three the same way.

And they should not feel forced to.

Three common mistakes

1. Treating ai-input as if it were just another word for training

It is not.

ai-input is about query-time use, not necessarily long-term model improvement.

It does not.

A site may want to appear in linked search results but still reject some answer-time model ingestion or training use.

3. Reading these signals as hard technical enforcement

They are not universal enforcement.

Cloudflare explicitly describes Content Signals as trust-based declarations about intended use after access.

That is still valuable. But it is not the same thing as a firewall.

Where Better Robots.txt fits

This is one of the strongest places where Better Robots.txt makes the modern landscape easier to reason about.

It helps site owners publish a more explicit posture instead of collapsing everything into a crude allow/deny model.

That makes it easier to express a policy such as:

  • stay visible in search;
  • allow some answer-time use;
  • refuse training;
  • keep the published policy coherent across files.

That is a governance gain even when no single signal guarantees universal obedience.

The correct mental model

The safest model is this:

  • search = discoverability
  • ai-input = answer-time model use
  • ai-train = future model improvement

Once those are separated clearly, policy stops sounding vague and starts becoming operational.