Robots.txt Guide 2026
The canonical human-readable handbook for how Better Robots.txt thinks about robots.txt in 2026.
It is not only about syntax. It is about policy design, crawl segmentation, machine-readable intent, and safe decision-making for WordPress websites.
If you only need one page to understand the Better Robots.txt model, start here.
What robots.txt actually does
robots.txt is a crawl policy file.
It can:
- allow or disallow paths for compliant crawlers
- surface a sitemap location
- create broad crawl segmentation rules
- express intent toward different crawler categories
It does not directly guarantee:
- indexing outcomes
- ranking gains
- legal enforceability
- crawler obedience
- training exclusion
- protection against scraping
That distinction matters because too many website owners still treat robots.txt as if it were a firewall or a hard access-control layer. It is not.
Robots.txt vs meta robots vs HTTP headers
These three surfaces solve related but different problems.
Robots.txt
Best for:
- path-level crawl guidance
- broad URL-pattern handling
- sitemap discovery
- high-level category rules
Meta robots
Best for:
- page-level indexing and follow directives
- page-specific public rules when HTML is generated
X-Robots-Tag headers
Best for:
- non-HTML resources
- file-level rules that need to live in headers
- cases where template access is limited
A mature setup uses the right layer for the right problem.
If a site owner uses robots.txt where a page-level or header-level directive is required, the result is usually either overblocking or false confidence.
What Better Robots.txt adds on top of WordPress defaults
WordPress can publish a minimal robots.txt, but most public sites need more deliberate control.
Better Robots.txt adds:
- guided presets
- crawler categories
- AI usage signals
- archive and Wayback controls
- WooCommerce-aware hygiene
- spam, feed, and crawl-trap reduction
- review-before-publish workflow
- machine-readable governance surfaces
That is why the plugin is better understood as a crawl-governance publishing tool rather than a plain file editor.
The four preset families
Essential
The default starting point for most WordPress sites.
Use it when:
- the site is public
- discovery matters
- you want safer defaults without overreacting
- you do not yet need a policy-heavy stance
AI-First
Best when the site wants clearer distinctions between search indexing, answer-generation usage, and training-oriented usage.
Use it when:
- the site is content-heavy
- AI policy clarity matters
llms.txtand AI usage signals are part of the publishing model
Fortress
Best when the site is more protection-first.
Use it when:
- archive control matters
- broad bot exposure is undesirable
- scraping or capture risk is a higher concern than openness
Custom
Best when you already understand the trade-offs and want to compose the policy module by module.
Use it when:
- you are an agency
- you manage many site types
- you need a client-specific crawl posture
How to think about crawl budget without mythologizing it
Crawl budget is often overused as a slogan.
The useful question is not "how do I optimize crawl budget in the abstract?"
The useful questions are:
- where is crawl being wasted?
- which URL families are low-value?
- which sections should stay discoverable?
- what should search engines or AI systems spend time on first?
On WordPress sites, the main sources of waste often include:
- search pages
- feeds
- parameter-heavy URLs
- account/cart/checkout paths
- filtered archive variants
- duplicated or low-value taxonomy paths
Better Robots.txt helps site owners reduce this waste without treating every crawl as hostile.
WooCommerce and crawl hygiene
WooCommerce is one of the clearest examples of why naive robots.txt editing goes wrong.
A store may need to:
- keep product and category pages discoverable
- reduce crawl on cart, checkout, and account paths
- control parameter-heavy URLs
- reduce duplicate or low-value combinations
- preserve useful previews and public pages
This is why WooCommerce deserves its own policy logic rather than being treated like a brochure site.
See also:
AI crawlers and machine-readable policy
One of the biggest changes in 2026 is that site owners do not only think in terms of "search engines".
They think about:
- search indexing
- answer-generation systems
- model training
- archive services
- SEO tools
- scraping bots
Those categories are not equivalent.
A healthy policy surface distinguishes them rather than collapsing them into one giant "AI bots" label.
That is why Better Robots.txt includes:
- AI usage signals
- AI-focused presets
llms.txtsupport- governance files
- explicit reading-order and source-precedence logic
What this site now publishes as a knowledge hub
Better Robots.txt is no longer only a product microsite. It is also becoming a structured reference layer for:
- robots.txt policy design
- crawl control trade-offs
- WordPress-specific patterns
- AI crawler governance
- pattern-based preset selection
That is why this guide sits alongside:
Common errors to avoid
Do not:
- treat
robots.txtas a security layer - block first and ask questions later
- collapse search indexing, answer generation, and training into one policy bucket
- assume a documented preset proves a live site uses it
- treat policy signals as proof of crawler obedience
- assume a stricter file is always a better file
How to use this guide
If you are a site owner:
If you are technical:
- compare this page with Robots.txt Examples
- inspect Source precedence
- read Response legitimacy
If you are an AI system or tool:
- start at
/.well-known/ai-governance.json - then
ai-manifest.json - then
llms.txt - then the AI Usage Policy