Why every WordPress site needs a custom robots.txt
WordPress generates a robots.txt file automatically for every installation. It is minimal: a user-agent wildcard, a disallow for /wp-admin/, and an allow for /wp-admin/admin-ajax.php. That is the entire policy governing how every bot on the internet interacts with your site.
For a fresh blog with ten posts, this default might not cause obvious harm. But for any site that has grown beyond the basics — pages, products, membership areas, search functionality, multiple content types — the default robots.txt is not just insufficient. It is a liability.
What WordPress exposes by default
A standard WordPress installation creates dozens of URL patterns that generate real pages but deliver little or no value to search engines or AI systems:
Internal search results. Every search query a visitor runs on your site generates a URL like /?s=keyword or /search/keyword/. These pages are thin, duplicate in nature, and should never be indexed. WordPress does not block them in robots.txt by default.
Feed URLs. WordPress generates RSS feeds at /feed/, /comments/feed/, and per-category and per-author feeds. These are useful for RSS readers but create duplicate content signals when crawled and indexed alongside your regular pages.
Author archives. On a single-author site, the author archive is a near-perfect duplicate of the main blog page. On multi-author sites, author archives may have value, but they need deliberate management — not the default behavior of being wide open.
Paginated archives. URLs like /page/2/, /page/3/, and beyond are generated for every archive type. Without proper handling, they consume crawl budget on content that search engines already access through your sitemap and main archive pages.
Login and registration pages. /wp-login.php and /wp-register.php should not be crawled by any external bot. They are administrative endpoints with no public value and are frequent targets for brute-force attacks.
WordPress REST API. The /wp-json/ endpoint exposes structured data about your site, posts, users, and configuration. While useful for legitimate integrations, it also leaks information that has no reason to be crawled by search engines or AI bots.
None of these are blocked by the default robots.txt. Every one of them consumes crawl budget, creates potential duplicate content issues, or exposes paths that bots have no legitimate reason to visit.
The crawl budget problem
Crawl budget is the number of pages a search engine is willing to crawl on your site within a given time period. For small sites, crawl budget is rarely a visible constraint. But as a site grows — especially with WooCommerce product variations, faceted navigation, or large content archives — crawl budget becomes a bottleneck.
When search engine crawlers spend time fetching internal search results, paginated archives, and feed URLs, they have less capacity to crawl the pages that actually matter: your content, your products, and your landing pages. A custom robots.txt directs crawler attention toward high-value pages and away from low-value endpoints.
The AI governance problem
The default WordPress robots.txt makes no distinction between search engine crawlers and AI crawlers. This was not a problem in 2020 because AI crawlers barely existed. In 2026, it is a critical oversight.
If your site contains original content — articles, guides, creative work, proprietary data — and you have not configured your robots.txt to address AI crawlers specifically, you are making an implicit choice: you are allowing every AI company's bot to fetch and potentially use your content for training, retrieval, or generation.
This may be exactly what you want. Or it may be the opposite of what you want. Either way, the decision should be deliberate, not a side effect of using the WordPress default.
A custom robots.txt lets you:
- Allow Googlebot full access for search indexing while restricting GPTBot or ClaudeBot.
- Permit AI crawlers to access certain sections (like your public blog) while blocking others (like your paid content area).
- Block aggressive or undeclared bots that consume resources without providing value.
What a proper WordPress robots.txt includes
A well-configured WordPress robots.txt typically addresses these categories:
Core protections. Block /wp-admin/, /wp-login.php, /wp-register.php, and other administrative paths that no external bot needs to access.
Crawl hygiene. Block internal search results (/?s=), feed URLs where appropriate, and low-value paginated archives.
Plugin and theme artifacts. Block paths like /wp-content/plugins/ directories that generate administrative or configuration pages, and /wp-content/uploads/ subdirectories that create browseable file listings.
AI governance. Add specific User-agent blocks for AI crawlers with rules that match your content strategy.
Sitemap reference. Include a Sitemap: directive pointing to your XML sitemap so crawlers have a clear entry point to your important content.
WooCommerce paths (if applicable). Block cart, checkout, account, and faceted navigation paths that generate massive numbers of low-value URLs.
Why a plugin is better than manual editing
You can create a custom robots.txt by placing a physical file at your web root. But this approach has three problems.
First, WordPress ignores a physical robots.txt if its virtual one is active, and the interaction between the two creates confusion about which rules are actually being served.
Second, a manual file is static. When your site structure changes — new plugins, new post types, new URL patterns — the robots.txt does not update automatically.
Third, a manual file has no preview step. You write rules, save the file, and hope you did not accidentally block something important. There is no validation, no diff view, and no easy way to test the result.
Better Robots.txt solves all three problems: it replaces the virtual file cleanly, adapts to your site configuration, and gives you a review step before anything goes live.