Skip to content

Robots.txt and multilingual sites: crawl budget, hreflang, and common traps

A multilingual WordPress site doubles or triples the number of crawlable URLs. Every page exists in multiple language versions, each with its own URL, its own hreflang annotations, and its own crawl requirements. This creates specific challenges for robots.txt configuration that monolingual sites never face.

The URL multiplication problem

A monolingual site with 100 pages has 100 URLs to crawl. A bilingual site with the same content has 200. A trilingual site has 300. Each language version needs to be crawled, indexed, and maintained in the search engine's index independently.

This multiplication directly affects crawl budget. The same hosting infrastructure serves two or three times more crawler requests. The same content base generates two or three times more pages competing for crawl attention. Low-value URL patterns that are minor annoyances on a monolingual site become significant crawl budget drains when multiplied across languages.

Robots.txt applies to the entire domain

Robots.txt is a single file at the domain root. There is no per-language robots.txt. A rule that blocks /page/ blocks it in every language directory: /en/page/, /fr/page/, /de/page/.

This means that robots.txt rules must be written with awareness of the full URL structure across all languages. A rule that is correct for the English site may have unintended consequences for the French or German versions if the URL patterns differ.

Common multilingual URL structures include subdirectory models (/en/, /fr/), subdomain models (en.example.com, fr.example.com), and parameter models (?lang=en, ?lang=fr). Subdirectory models are the most common in WordPress and the most straightforward for robots.txt because all languages share a single file.

The hreflang interaction

Hreflang annotations tell search engines which pages are language equivalents of each other. A properly configured hreflang setup ensures that French users see the French version of a page in search results, not the English version.

The interaction with robots.txt creates a subtle trap: if you block one language version of a page in robots.txt but not the other, the hreflang relationship breaks. Google cannot validate a hreflang annotation that points to a blocked page. The result is that both language versions may display incorrectly in search results, or the blocked version may appear as a URL-only listing without a snippet.

The rule is simple: if a page has hreflang annotations, all language versions must be crawlable. Do not block one language version while keeping the other accessible.

What to block on multilingual sites

The same categories of low-value URLs that waste crawl budget on monolingual sites apply to multilingual sites, but with language-aware patterns:

Internal search results should be blocked for all languages. If searches generate URLs like /?s=keyword or /en/?s=keyword, the disallow rule should cover all variants.

Paginated archives should be blocked consistently across languages. A rule blocking /en/page/ but not /fr/page/ creates an asymmetry that wastes French crawl budget.

Plugin-generated paths often appear identically in all language directories. WooCommerce cart and checkout paths, admin endpoints, and feed URLs should be blocked regardless of language prefix.

The sitemap coordination

On a multilingual site, the sitemap and robots.txt alignment is doubly important. Each language version of each page should appear in the sitemap with correct hreflang annotations, and no sitemap entry should match a robots.txt disallow rule.

Better Robots.txt applies its rules consistently across the language structure of the site. When you configure a disallow for a URL pattern, it applies to all language directories. The sitemap directive in the generated robots.txt points to the sitemap index, which contains all language versions.