Skip to content

Google-Extended vs Googlebot: how to block AI training without losing search indexing

Google operates two crawlers that site owners need to distinguish. Googlebot indexes pages for Google Search. Google-Extended collects content for training Google's Gemini AI models. They share the same IP ranges, but they serve completely different purposes — and blocking the wrong one has very different consequences.

What Googlebot does

Googlebot is Google's primary web crawler. It fetches pages, renders JavaScript, reads structured data, and builds the search index that powers Google Search, Google News, and Google Discover. When someone finds your site through a Google search, it is because Googlebot crawled and indexed your content.

Blocking Googlebot removes your site from Google Search entirely. There is no partial effect: a Disallow: / rule for Googlebot means Google cannot crawl any page, which means no page appears in search results. This is almost never what a site owner intends.

What Google-Extended does

Google-Extended is a separate user agent introduced by Google specifically for AI-related data collection. It collects content to train and improve Gemini, Google's family of large language models.

The critical distinction: blocking Google-Extended does not affect your Google Search indexing. Your pages continue to appear in search results, your rankings are unaffected, and Googlebot continues to crawl normally. The only thing that changes is that your content stops being used for Gemini model training.

This separation was a direct response to publisher demand. Before Google-Extended existed, there was no way to opt out of AI training without also opting out of search indexing. The two-agent model gives site owners granular control.

How to configure each in robots.txt

The configuration is straightforward:

To allow search indexing and block AI training, add a specific block for Google-Extended while leaving Googlebot unrestricted:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

To allow both search indexing and AI training, simply do not add any Google-Extended rules. The default is open access.

To block everything from Google (search and AI), block Googlebot. Google-Extended respects Googlebot rules, so blocking Googlebot implicitly blocks both.

Common mistakes

The most dangerous mistake is confusing the two agents. A site owner who adds Disallow: / under User-agent: Googlebot thinking they are blocking AI training will remove their entire site from Google Search.

Another common error is blocking Google-Extended with a wildcard user-agent rule. If your User-agent: * block contains Disallow: /, this applies to Google-Extended but also to every other crawler without a specific block — including AI crawlers you may want to treat differently.

The safest approach is explicit per-agent rules. Better Robots.txt organizes this through its AI governance module, where Google-Extended has its own toggle separate from Googlebot.

Applebot-Extended follows the same pattern

Apple introduced the same separation: Applebot handles Siri and Spotlight indexing, while Applebot-Extended collects data for Apple Intelligence training. Blocking Applebot-Extended does not affect Siri results.

This trend of agent differentiation is the most significant structural change in the AI crawler landscape since 2024. It gives site owners the ability to say yes to discovery while saying no to training — a distinction that robots.txt alone could not express before these dedicated agents existed.