Skip to content

Robots.txt for publishers and news sites: speed, protection, and AI control

Publishing sites operate under constraints that most WordPress installations never face. Content freshness is measured in minutes. Crawl frequency determines how quickly articles appear in Google News and Discover. And the value of original reporting is directly threatened by AI systems that can summarize an article before a reader ever visits the source.

A publisher's robots.txt is not just a crawl configuration. It is an editorial policy expressed in machine syntax.

The speed imperative

For news sites, the interval between publishing an article and having it indexed by search engines is a competitive metric. A breaking news story that takes 30 minutes to appear in Google News loses to a competitor that gets indexed in 5 minutes.

Robots.txt affects indexing speed indirectly by shaping how efficiently crawlers use their limited visit time. Every request a crawler spends on a low-value URL is a request it did not spend on fresh content. For publishers, the crawl budget optimization is not about saving server resources — it is about accelerating discovery of new articles.

The practical implications are aggressive: block everything that is not editorial content. Internal search results, author archives on single-author sites, date-based archives that duplicate the main feed, tag pages with single-digit article counts, and feed URLs that replicate the sitemap should all be candidates for disallow rules.

At the same time, category archives, topic pages, and author pages on multi-author sites often serve as landing pages for ongoing coverage areas. Blocking these removes entry points that both readers and crawlers use to discover content depth.

The AMP and syndication factor

Publishers that use AMP (Accelerated Mobile Pages) or syndicate content through Google News, Apple News, or Flipboard need to ensure that their robots.txt does not interfere with the specialized crawlers these platforms use. Google-News bot has a distinct user-agent. Apple-News bot has its own. Blocking broad patterns without considering these agents can remove content from distribution channels that drive significant traffic.

The safest approach is explicit whitelisting: use specific User-agent blocks for distribution-critical crawlers with Allow rules that guarantee access to editorial content, regardless of what the general User-agent: * rules restrict.

Content protection in the AI era

No industry has felt the impact of AI content extraction more acutely than publishing. A news article that took hours to research and write can be summarized by an AI system in seconds, delivering the essential information to a user who then has no reason to click through to the source.

For publishers, the robots.txt configuration around AI crawlers is not a theoretical governance exercise. It is a direct business decision with measurable revenue implications.

The spectrum of approaches runs from fully open to fully closed:

Full access: Allow all AI crawlers to access all content. This maximizes AI visibility (articles may be cited or linked in AI-generated answers) but provides no protection against extraction.

Selective access: Allow AI crawlers to access headlines, category pages, and metadata while blocking full article content. This maintains visibility in AI search results while protecting the body text that represents editorial value.

Training opt-out: Allow AI crawlers to access content for real-time retrieval (answering user questions with attribution) while blocking use for model training. This is implemented by allowing bots like ChatGPT-User while blocking GPTBot, or by allowing Googlebot while blocking Google-Extended.

Full block: Disallow all AI crawlers from all editorial content. This is the maximum protection position but removes the site from AI-mediated discovery entirely.

Most publishers are settling on a hybrid approach: allowing search engine crawlers full access, permitting AI retrieval with attribution expectations, and blocking training-specific crawlers. Better Robots.txt's AI governance module was designed with exactly this use case in mind, letting publishers toggle each AI crawler category independently.

Paywall and metered content

Publishers with subscription models face an additional complexity. Content behind a paywall or metered access should generally remain crawlable for indexing purposes (otherwise it disappears from search results) but may need different robots.txt treatment for AI crawlers.

A common configuration: allow Googlebot to access paywalled content (using the Googlebot user-agent and proper paywall schema markup) while blocking AI crawlers from the same paths. This ensures articles remain indexed for search while preventing AI extraction of premium content.

The practical configuration

A publisher's robots.txt typically needs these elements:

Aggressive cleanup of non-editorial URLs: search results, administrative pages, asset directories, plugin endpoints, and low-value archives.

Explicit allow rules for editorial content paths, ensuring that no disallow pattern accidentally blocks articles.

Dedicated user-agent blocks for Google-News, Applebot, and other distribution crawlers with permissive access to editorial content.

Separate user-agent blocks for AI crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) with rules that reflect the publisher's AI access policy.

A sitemap directive pointing to the news sitemap, which updates in real time and tells crawlers exactly where to find fresh content.

Better Robots.txt's preset system addresses this through the AI-First preset, which is designed for publishers who want AI governance without sacrificing search engine access. The preset establishes clear boundaries between search indexing, AI retrieval, and AI training, with each category configurable independently.