Applebot vs Applebot-Extended: search crawl, downstream use, and the line between them

Apple is one of the clearest examples of why a modern machine-access policy needs more than one mental bucket.

There is a search crawler. And there is a secondary control surface for downstream data use.

Those two surfaces are:

Applebot
Applebot-Extended

If you confuse them, you can accidentally block Apple visibility when your real objective was only to limit Apple’s generative AI training use.

The short version

Here is the cleanest way to think about the two Apple surfaces.

Surface	What it is for	Main question
`Applebot`	Search crawl and rendering for Apple search-related surfaces	Do I want this site discoverable in Apple search experiences like Spotlight, Siri, and Safari?
`Applebot-Extended`	Downstream data-use control for Apple foundation models	Do I allow Apple to use crawled site content to train generative foundation models?

That separation is the point.

What `Applebot` actually controls

Applebot is Apple’s main web crawler.

Apple says the data crawled by Applebot is used to power various search-related features across Apple’s ecosystem, including Spotlight, Siri, and Safari. It also notes that enabling Applebot in robots.txt allows website content to appear in search results for Apple users in those products.

So when the business question is:

do we want Apple users to discover this content;
do we want Apple search surfaces to crawl and index the site;
do we want Apple to render and understand the page correctly;

the relevant surface is Applebot.

Apple also documents several important operational points:

Applebot generally respects standard robots.txt directives targeted at Applebot;
if robots instructions do not mention Applebot but do mention Googlebot, Applebot follows the Googlebot instructions;
Applebot may render the website in a browser-like environment, which means JavaScript, CSS, and other required resources should not be blocked by mistake.

This last point matters more than many teams think.

If Applebot cannot access the resources required to render the page, Apple warns that it may not be able to understand the page properly.

What `Applebot-Extended` actually controls

Applebot-Extended is not the search crawler.

It is Apple’s secondary user agent for controlling data usage.

Apple says publishers can use Applebot-Extended to opt out of having their website content used to train Apple foundation models that power generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

That makes it the right surface when the real question is:

do we want to refuse Apple’s model-training use;
do we want to stay visible in Apple search surfaces while limiting downstream AI use;
do we want a cleaner split between discoverability and generative reuse?

The most important detail is this:

Applebot-Extended does not crawl webpages.

Apple explicitly says pages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how Apple uses the data already crawled by Applebot.

That makes Apple one of the easiest ecosystems to explain when teaching the distinction between crawl and downstream use.

Which Apple control should you use?

Use the following decision path.

Goal: remain visible in Apple search experiences

Keep Applebot open.

If you block Apple’s search crawler, you should expect search discoverability across those Apple surfaces to suffer.

Goal: refuse Apple model-training use while staying discoverable

Disallow Applebot-Extended, not Applebot.

That is the clean Apple equivalent of "keep search, refuse training".

Goal: control indexing or snippet behavior

Remember that crawl access is not the only layer.

Applebot also supports robots meta tags such as noindex, nosnippet, nofollow, and none for page-level indexing and preview behavior.

Goal: ensure Apple understands the page correctly

Do not accidentally block the JavaScript, CSS, XHR, or other assets the page requires to render cleanly.

Three common mistakes

1. Blocking `Applebot` when the real goal was only to refuse AI training

This is the classic category error.

If discoverability still matters, do not use the search crawler as a substitute for the downstream-use control.

2. Forgetting that Applebot renders pages

When required assets are blocked, the page may be crawled but not understood well.

That weakens the outcome even when the main crawler remains allowed.

3. Expecting `Applebot-Extended` to control search visibility

It does not.

Apple says it does not crawl webpages and that blocked pages can still appear in search results.

Where Better Robots.txt fits

Better Robots.txt helps on the policy side of this distinction.

It helps teams:

separate search crawl from downstream training-use control;
avoid destroying discoverability by overblocking the wrong Apple surface;
keep that distinction coherent with broader machine-governance outputs.

It does not replace page-level indexing controls, nor does it replace infrastructure enforcement.

But it does help you publish a much clearer decision.

The correct mental model

The safest Apple mental model is this:

Applebot = search crawl and rendering
Applebot-Extended = downstream data-use control for Apple generative AI training

If those are separated clearly, Apple becomes one of the easiest ecosystems to govern without contradiction.

Applebot vs Applebot-Extended: search crawl, downstream use, and the line between them ​

The short version ​

What Applebot actually controls ​

What Applebot-Extended actually controls ​

Which Apple control should you use? ​

Goal: remain visible in Apple search experiences ​

Goal: refuse Apple model-training use while staying discoverable ​

Goal: control indexing or snippet behavior ​

Goal: ensure Apple understands the page correctly ​

Three common mistakes ​

1. Blocking Applebot when the real goal was only to refuse AI training ​

2. Forgetting that Applebot renders pages ​

3. Expecting Applebot-Extended to control search visibility ​

Where Better Robots.txt fits ​

The correct mental model ​

Related ​