Skip to content

Sitemap XML and robots.txt: how to make them work together

A sitemap and a robots.txt file serve opposite functions. The sitemap says "here is the content I want you to find." The robots.txt says "here is the content I want you to skip." Together, they form a complete crawl policy: an include list and an exclude list that guide every bot visiting your site.

In practice, most WordPress sites treat these two files as unrelated. The SEO plugin generates a sitemap. The robots.txt is either the WordPress default or a manually edited file. Nobody checks whether they agree. And when they disagree, the consequences are silent: wasted crawl budget, delayed indexing, and conflicting signals that search engines resolve on their own terms — not yours.

The fundamental contract

The relationship between sitemap and robots.txt should follow a simple rule: no URL should appear in both your sitemap and your robots.txt disallow list.

If a URL is in your sitemap, you are telling crawlers "this page is important, please crawl and index it." If that same URL matches a disallow rule in robots.txt, you are telling crawlers "do not access this page." These two instructions are contradictory. When a crawler encounters this conflict, it must decide which signal to prioritize.

Google's documentation states that robots.txt takes precedence: if a URL is disallowed, Google will not crawl it, regardless of whether it appears in the sitemap. The sitemap entry is ignored. The page may still appear in search results as a URL-only listing (without a snippet) if other sites link to it, but Google will never fetch the content.

Common conflicts in WordPress

WordPress and its ecosystem of plugins create several common sitemap-robots conflicts:

Tag and category archives. Some SEO plugins include tag pages and category archives in the sitemap by default. If you have added Disallow: /tag/ or Disallow: /category/ in your robots.txt (a common recommendation for thin content management), these URLs are simultaneously invited and blocked.

Author archives. On sites with a single author, the author archive duplicates the main blog listing. Some configurations include author archives in the sitemap while also blocking them in robots.txt.

Paginated content. Paginated archive pages (/page/2/, /page/3/) sometimes appear in sitemaps generated by plugins, even when robots.txt blocks the /page/ pattern.

WooCommerce filtered pages. Product attribute filter URLs can generate thousands of sitemap entries while also being blocked by robots.txt rules targeting query parameters or faceted navigation paths.

Each of these conflicts sends a mixed signal. The crawler sees the URL in the sitemap, attempts to fetch it, reads the robots.txt disallow, and abandons the request. The crawl attempt still counts against your crawl budget, but no useful work gets done.

The Sitemap directive in robots.txt

The Sitemap: line in robots.txt is not technically a robots.txt rule. It is an extension supported by Google, Bing, and most other major crawlers. Its purpose is to tell bots where your sitemap lives, independently of any disallow rules.

This directive matters more than most site owners realize, for two reasons.

First, it provides a guaranteed discovery path. Search Console and Bing Webmaster Tools also accept sitemap submissions, but the robots.txt directive works for any crawler — including AI bots and archive services — without requiring them to check a proprietary tool.

Second, it is the only positive signal in robots.txt. Every other rule in the file restricts access. The Sitemap: directive is the one line that actively points crawlers toward content you want them to find. Omitting it is a missed opportunity.

Alignment checklist

To ensure your sitemap and robots.txt are working together instead of against each other, verify these points:

Every URL in your sitemap is crawlable. Run your sitemap through a crawler or use an SEO audit tool to check each URL against your robots.txt. Any URL that returns "blocked by robots.txt" is a conflict that needs resolution.

No disallowed URL appears in the sitemap. Your SEO plugin should exclude URLs that match robots.txt disallow patterns. If it does not do this automatically, configure it manually or switch to a plugin that handles the interaction correctly.

The Sitemap directive points to the correct URL. If your SEO plugin uses a sitemap index (like /sitemap_index.xml), your robots.txt Sitemap: line should point to the index, not to individual child sitemaps.

Sitemap URLs use the canonical domain. If your site redirects from http to https or from www to non-www, your sitemap should use the final canonical URL. A mismatch between the sitemap URL and the canonical URL creates unnecessary redirect chains for crawlers.

Priority and frequency values are meaningful. A sitemap where every page has a priority of 0.5 and a changefreq of weekly provides no useful information. Use priority values to reflect actual page importance and changefreq to reflect actual update patterns. If you cannot maintain accurate values, omit them entirely — an absent signal is better than an inaccurate one.

The AI dimension

AI crawlers that respect robots.txt also benefit from a well-structured sitemap, even though most AI crawlers do not process sitemaps the same way search engines do. The emerging llms.txt file serves a similar purpose for AI systems: it points them toward content that is relevant and permitted, much like a sitemap points search engines toward important pages.

Better Robots.txt generates both files in coordination: the robots.txt rules and the sitemap directive are produced from the same configuration, reducing the risk of contradictions between what you block and what you promote.

The goal is a single, coherent crawl policy where the include list and the exclude list never overlap.