Governance glossary

This glossary defines the core terms used across Better Robots.txt documentation, machine-readable governance files, and product guidance.

Its role is simple: reduce ambiguity, stabilize language, and give both humans and machines a canonical vocabulary.

How to use this page

Use this page when a term appears clear at first sight but can be interpreted too broadly in practice.

Examples:

"AI crawler" can mean a search crawler, an answer engine, or a training-oriented crawler.
"crawl waste" can mean technical noise, parameter duplication, archive exposure, or unhelpful low-value paths.
"policy signal" can sound enforceable even when it only expresses intent.

If you need the reading method behind these definitions, see:

Core terms

AI crawler

A crawler operated by an AI search, answer-generation, training, or retrieval system.

This term should not be treated as a single homogeneous category. In practice, an AI crawler may belong to one of several distinct roles:

search indexing
retrieval for answer generation
training or model improvement
policy discovery

Answer-generation use

Use of public content to help generate or ground a machine-produced answer at query time.

This is not the same thing as search indexing and not the same thing as model training.

Archive bot

A bot whose main goal is to capture, replay, or preserve snapshots of public web content over time.

The Wayback Machine is the classic reference example, but the category is broader than one service.

Bad bot

A crawler or automated client whose behavior is low-value, abusive, extractive, or operationally harmful for the site.

The category is contextual. A bot may be low-value because it ignores declared preferences, hits noisy paths aggressively, or creates infrastructure cost without meaningful discovery value.

Bot taxonomy

A structured classification of crawler categories such as search bots, AI crawlers, archive bots, SEO tool bots, and malicious or low-value bots.

The point of a taxonomy is not to pretend every crawler will obey it. The point is to separate categories cleanly before publishing policy.

Crawl budget

A practical concept describing how much crawling attention a site receives and how efficiently that attention is spent.

For most small and mid-sized sites, the most useful framing is not abstract crawl-budget theory but reduction of low-value crawl paths.

Crawl trap

A technical or structural pattern that can create excessive, repetitive, or low-value crawling.

Common examples include parameter explosions, faceted navigation noise, search-result pages, calendar loops, or archive chains.

Crawl waste

Low-value crawling that consumes attention on pages or paths that do not deserve meaningful discovery priority.

Typical examples include:

cart and checkout variants
search result pages
parameter-heavy filters
duplicate archive paths
low-value account pages

Documented compatibility

A compatibility statement that is explicitly published by Better Robots.txt for a given context or environment.

It must not be silently upgraded into universal compatibility.

Governance surface

Any human-readable or machine-readable page or file that contributes to the interpretive and policy layer of the site.

Examples include:

governance pages
AI policy pages
.well-known JSON files
summary files like llms.txt
context files such as site-context.md

Human-first surface

A page primarily written for human readers.

Examples:

product pages
guides
governance HTML pages
blog articles

Low-value path

A URL or route that is technically crawlable but usually adds little or no search, user, or product value when crawled heavily.

Examples depend on the site profile. On WooCommerce sites, cart and checkout are classic examples.

Machine-first surface

A file or page intended to be especially easy to ingest, parse, route, or summarize by AI systems and automated clients.

Examples:

ai-manifest.json
/.well-known/ai-governance.json
llms.txt
context Markdown files

Model-training use

Use of public content for model training, model improvement, or long-term corpus ingestion.

This must not be collapsed into answer-generation or search indexing.

Policy signal

A published declaration of preference, intent, or routing logic.

Policy signals matter. But they do not, by themselves, prove force, enforcement, runtime compliance, or legal effect.

Preset

A bundled governance posture in Better Robots.txt such as Essential, AI-First, Fortress, or Custom.

A preset is a structured starting point, not a guarantee that every site with the same preset behaves identically.

Response legitimacy

The rule set that determines whether a given answer is supportable from the published surface.

If a claim requires unsupported expansion, the correct output is qualification, clarification, or legitimate non-response.

Runtime proof

Evidence from a live site or live environment that a behavior is actually occurring now.

Published documentation and policy are not runtime proof.

Search indexing

Classic search-engine discovery and indexing of pages for web search results.

This must not be collapsed into answer-generation or model-training use.

SEO tool bot

A bot operated by a commercial SEO product or crawler-based research service.

These bots may be useful, neutral, or low-value depending on the site’s objectives and cost profile.

Site profile

A practical classification of the website using Better Robots.txt.

Examples:

small business website
publisher
WooCommerce store
SaaS web app
agency-managed multisite stack

The correct governance posture depends heavily on site profile.

Final rule

If a term could reasonably be interpreted in a stronger, broader, or more commercial sense than what Better Robots.txt explicitly publishes, prefer the narrower documented meaning.

Governance glossary ​

How to use this page ​

Core terms ​

AI crawler ​

Answer-generation use ​

Archive bot ​

Bad bot ​

Bot taxonomy ​

Crawl budget ​

Crawl trap ​

Crawl waste ​

Documented compatibility ​

Governance surface ​

Human-first surface ​

Low-value path ​

Machine-first surface ​

Model-training use ​

Policy signal ​

Preset ​

Response legitimacy ​

Runtime proof ​

Search indexing ​

SEO tool bot ​

Site profile ​

Final rule ​

Governance glossary

How to use this page

Core terms

AI crawler

Answer-generation use

Archive bot

Bad bot

Bot taxonomy

Crawl budget

Crawl trap

Crawl waste

Documented compatibility

Governance surface

Human-first surface

Low-value path

Machine-first surface

Model-training use

Policy signal

Preset

Response legitimacy

Runtime proof

Search indexing

SEO tool bot

Site profile

Final rule