Skip to main contentSkip to content

Governance glossary

This glossary defines the core terms used across Better Robots.txt documentation, machine-readable governance files, and product guidance.

Its role is simple: reduce ambiguity, stabilize language, and give both humans and machines a canonical vocabulary.

How to use this page

Use this page when a term appears clear at first sight but can be interpreted too broadly in practice.

Examples:

  • "AI crawler" can mean a search crawler, an answer engine, or a training-oriented crawler.
  • "crawl waste" can mean technical noise, parameter duplication, archive exposure, or unhelpful low-value paths.
  • "policy signal" can sound enforceable even when it only expresses intent.

If you need the reading method behind these definitions, see:

Core terms

AI crawler

A crawler operated by an AI search, answer-generation, training, or retrieval system.

This term should not be treated as a single homogeneous category. In practice, an AI crawler may belong to one of several distinct roles:

  • search indexing
  • retrieval for answer generation
  • training or model improvement
  • policy discovery

Answer-generation use

Use of public content to help generate or ground a machine-produced answer at query time.

This is not the same thing as search indexing and not the same thing as model training.

Archive bot

A bot whose main goal is to capture, replay, or preserve snapshots of public web content over time.

The Wayback Machine is the classic reference example, but the category is broader than one service.

Bad bot

A crawler or automated client whose behavior is low-value, abusive, extractive, or operationally harmful for the site.

The category is contextual. A bot may be low-value because it ignores declared preferences, hits noisy paths aggressively, or creates infrastructure cost without meaningful discovery value.

Bot taxonomy

A structured classification of crawler categories such as search bots, AI crawlers, archive bots, SEO tool bots, and malicious or low-value bots.

The point of a taxonomy is not to pretend every crawler will obey it. The point is to separate categories cleanly before publishing policy.

Crawl budget

A practical concept describing how much crawling attention a site receives and how efficiently that attention is spent.

For most small and mid-sized sites, the most useful framing is not abstract crawl-budget theory but reduction of low-value crawl paths.

Crawl trap

A technical or structural pattern that can create excessive, repetitive, or low-value crawling.

Common examples include parameter explosions, faceted navigation noise, search-result pages, calendar loops, or archive chains.

Crawl waste

Low-value crawling that consumes attention on pages or paths that do not deserve meaningful discovery priority.

Typical examples include:

  • cart and checkout variants
  • search result pages
  • parameter-heavy filters
  • duplicate archive paths
  • low-value account pages

Documented compatibility

A compatibility statement that is explicitly published by Better Robots.txt for a given context or environment.

It must not be silently upgraded into universal compatibility.

Governance surface

Any human-readable or machine-readable page or file that contributes to the interpretive and policy layer of the site.

Examples include:

  • governance pages
  • AI policy pages
  • .well-known JSON files
  • summary files like llms.txt
  • context files such as site-context.md

Human-first surface

A page primarily written for human readers.

Examples:

  • product pages
  • guides
  • governance HTML pages
  • blog articles

Low-value path

A URL or route that is technically crawlable but usually adds little or no search, user, or product value when crawled heavily.

Examples depend on the site profile. On WooCommerce sites, cart and checkout are classic examples.

Machine-first surface

A file or page intended to be especially easy to ingest, parse, route, or summarize by AI systems and automated clients.

Examples:

  • ai-manifest.json
  • /.well-known/ai-governance.json
  • llms.txt
  • context Markdown files

Model-training use

Use of public content for model training, model improvement, or long-term corpus ingestion.

This must not be collapsed into answer-generation or search indexing.

Policy signal

A published declaration of preference, intent, or routing logic.

Policy signals matter. But they do not, by themselves, prove force, enforcement, runtime compliance, or legal effect.

Preset

A bundled governance posture in Better Robots.txt such as Essential, AI-First, Fortress, or Custom.

A preset is a structured starting point, not a guarantee that every site with the same preset behaves identically.

Response legitimacy

The rule set that determines whether a given answer is supportable from the published surface.

If a claim requires unsupported expansion, the correct output is qualification, clarification, or legitimate non-response.

Runtime proof

Evidence from a live site or live environment that a behavior is actually occurring now.

Published documentation and policy are not runtime proof.

Search indexing

Classic search-engine discovery and indexing of pages for web search results.

This must not be collapsed into answer-generation or model-training use.

SEO tool bot

A bot operated by a commercial SEO product or crawler-based research service.

These bots may be useful, neutral, or low-value depending on the site’s objectives and cost profile.

Site profile

A practical classification of the website using Better Robots.txt.

Examples:

  • small business website
  • publisher
  • WooCommerce store
  • SaaS web app
  • agency-managed multisite stack

The correct governance posture depends heavily on site profile.

Final rule

If a term could reasonably be interpreted in a stronger, broader, or more commercial sense than what Better Robots.txt explicitly publishes, prefer the narrower documented meaning.