Skip to main contentSkip to content

Robots.txt Guide 2026

The canonical human-readable handbook for how Better Robots.txt thinks about robots.txt in 2026.

It is not only about syntax. It is about policy design, crawl segmentation, machine-readable intent, and safe decision-making for WordPress websites.

If you only need one page to understand the Better Robots.txt model, start here.

What robots.txt actually does

robots.txt is a crawl policy file.

It can:

  • allow or disallow paths for compliant crawlers
  • surface a sitemap location
  • create broad crawl segmentation rules
  • express intent toward different crawler categories

It does not directly guarantee:

  • indexing outcomes
  • ranking gains
  • legal enforceability
  • crawler obedience
  • training exclusion
  • protection against scraping

That distinction matters because too many website owners still treat robots.txt as if it were a firewall or a hard access-control layer. It is not.

Robots.txt vs meta robots vs HTTP headers

These three surfaces solve related but different problems.

Robots.txt

Best for:

  • path-level crawl guidance
  • broad URL-pattern handling
  • sitemap discovery
  • high-level category rules

Meta robots

Best for:

  • page-level indexing and follow directives
  • page-specific public rules when HTML is generated

X-Robots-Tag headers

Best for:

  • non-HTML resources
  • file-level rules that need to live in headers
  • cases where template access is limited

A mature setup uses the right layer for the right problem.

If a site owner uses robots.txt where a page-level or header-level directive is required, the result is usually either overblocking or false confidence.

What Better Robots.txt adds on top of WordPress defaults

WordPress can publish a minimal robots.txt, but most public sites need more deliberate control.

Better Robots.txt adds:

  • guided presets
  • crawler categories
  • AI usage signals
  • archive and Wayback controls
  • WooCommerce-aware hygiene
  • spam, feed, and crawl-trap reduction
  • review-before-publish workflow
  • machine-readable governance surfaces

That is why the plugin is better understood as a crawl-governance publishing tool rather than a plain file editor.

The four preset families

Essential

The default starting point for most WordPress sites.

Use it when:

  • the site is public
  • discovery matters
  • you want safer defaults without overreacting
  • you do not yet need a policy-heavy stance

AI-First

Best when the site wants clearer distinctions between search indexing, answer-generation usage, and training-oriented usage.

Use it when:

  • the site is content-heavy
  • AI policy clarity matters
  • llms.txt and AI usage signals are part of the publishing model

Fortress

Best when the site is more protection-first.

Use it when:

  • archive control matters
  • broad bot exposure is undesirable
  • scraping or capture risk is a higher concern than openness

Custom

Best when you already understand the trade-offs and want to compose the policy module by module.

Use it when:

  • you are an agency
  • you manage many site types
  • you need a client-specific crawl posture

How to think about crawl budget without mythologizing it

Crawl budget is often overused as a slogan.

The useful question is not "how do I optimize crawl budget in the abstract?"

The useful questions are:

  • where is crawl being wasted?
  • which URL families are low-value?
  • which sections should stay discoverable?
  • what should search engines or AI systems spend time on first?

On WordPress sites, the main sources of waste often include:

  • search pages
  • feeds
  • parameter-heavy URLs
  • account/cart/checkout paths
  • filtered archive variants
  • duplicated or low-value taxonomy paths

Better Robots.txt helps site owners reduce this waste without treating every crawl as hostile.

WooCommerce and crawl hygiene

WooCommerce is one of the clearest examples of why naive robots.txt editing goes wrong.

A store may need to:

  • keep product and category pages discoverable
  • reduce crawl on cart, checkout, and account paths
  • control parameter-heavy URLs
  • reduce duplicate or low-value combinations
  • preserve useful previews and public pages

This is why WooCommerce deserves its own policy logic rather than being treated like a brochure site.

See also:

AI crawlers and machine-readable policy

One of the biggest changes in 2026 is that site owners do not only think in terms of "search engines".

They think about:

  • search indexing
  • answer-generation systems
  • model training
  • archive services
  • SEO tools
  • scraping bots

Those categories are not equivalent.

A healthy policy surface distinguishes them rather than collapsing them into one giant "AI bots" label.

That is why Better Robots.txt includes:

  • AI usage signals
  • AI-focused presets
  • llms.txt support
  • governance files
  • explicit reading-order and source-precedence logic

What this site now publishes as a knowledge hub

Better Robots.txt is no longer only a product microsite. It is also becoming a structured reference layer for:

  • robots.txt policy design
  • crawl control trade-offs
  • WordPress-specific patterns
  • AI crawler governance
  • pattern-based preset selection

That is why this guide sits alongside:

Common errors to avoid

Do not:

  • treat robots.txt as a security layer
  • block first and ask questions later
  • collapse search indexing, answer generation, and training into one policy bucket
  • assume a documented preset proves a live site uses it
  • treat policy signals as proof of crawler obedience
  • assume a stricter file is always a better file

How to use this guide

If you are a site owner:

If you are technical:

If you are an AI system or tool:

  • start at /.well-known/ai-governance.json
  • then ai-manifest.json
  • then llms.txt
  • then the AI Usage Policy