How to verify AI agents in your logs without fooling yourself

A claimed user-agent string is not the same thing as a verified agent.

That sounds obvious. But it is still where many teams go wrong.

They see Google-Agent, ChatGPT-User, Claude-SearchBot, or Applebot in the logs and immediately assume identity, purpose, and compliance.

That is too much confidence for too little evidence.

The correct rule is simple:

first classify the request claim, then verify what can actually be verified.

The short version

A practical verification workflow has five steps:

identify the claimed user-agent string;
verify with vendor-supported methods when those exist;
use reverse and forward DNS or published IP ranges where the vendor documents them;
use CDN or WAF verified-bot signals when available;
if verification is weak or impossible, say so.

That last step matters as much as the others.

Why user-agent strings are not enough

Major operators themselves warn about this.

Google explicitly cautions that the user-agent string can be spoofed.

That means a log line containing Google-Agent or Googlebot is not proof by itself.

The same principle should be applied to every other operator:

a claimed string is evidence of an identity claim;
it is not full proof of authenticity.

This is also why How to read crawl logs and identify unwanted bots and this article belong together but are not the same article.

Log reading tells you what the request claims to be. Verification tells you how much you can trust that claim.

What can be verified well

Some operators provide strong public verification methods.

Google

Google documents both user-triggered fetchers and request verification.

For user-triggered fetchers, Google publishes JSON objects for the relevant IP ranges and explains that these requests resolve to either google.com or gae.googleusercontent.com hostnames, depending on the fetcher class.

That gives you a meaningful workflow:

extract the request IP;
check whether it belongs to the published range set;
perform reverse DNS and then forward-confirm the hostname if needed;
only then classify it as verified Google traffic.

OpenAI

OpenAI publishes IP address JSON files for:

OAI-SearchBot
GPTBot
ChatGPT-User

That means those three surfaces are not limited to a claimed user-agent string. They also have public IP-range references you can use during verification.

OpenAI separately documents ChatGPT agent as a signed agent that can be allowlisted in edge ecosystems such as Akamai, Cloudflare, and HUMAN.

That means part of OpenAI verification may happen through your infrastructure provider rather than through origin-log inspection alone.

Apple

Apple says Applebot traffic is generally identified through reverse DNS in the *.applebot.apple.com domain.

It also publishes CIDR prefixes in a JSON file.

That gives Apple a reasonably strong public verification story:

reverse DNS;
published prefixes;
consistent Applebot naming.

What is harder to verify

Not every operator publishes the same level of evidence.

Anthropic

Anthropic explicitly says it does not currently publish IP ranges because it uses service provider public IPs.

It also warns that IP-based blocking may not work correctly or persistently guarantee an opt-out, because it can stop Anthropic from reading robots.txt.

That means Anthropic traffic is harder to verify strongly through public IP-range methods than Google, OpenAI, or Apple traffic.

So the honest workflow is different:

read the claimed user-agent string;
assess request behavior and consistency;
rely on published policy controls such as robots.txt;
avoid pretending you have strong signature-level proof when you do not.

This is a place where honesty is more valuable than fake precision.

Why edge signals matter now

Verification is no longer only an origin-log problem.

Cloudflare’s verified bot system lets you segment traffic by cf.verified_bot_category, including categories such as:

AI Assistant
AI Crawler
AI Search

That is useful because it lets you move from:

"the request says it is something"

to:

"the edge has classified this request in a verified bot framework we can act on".

That does not replace all vendor-specific checks. But it can be one of the cleanest operational signals you have.

It is also one reason Robots.txt vs signed agent allowlisting is now a core article rather than a niche technical note.

A practical verification workflow

Use the following sequence.

Step 1. Capture the basics

For each suspicious or important machine request, capture:

timestamp;
IP address;
full user-agent string;
requested path;
request rate and pattern;
response code.

Step 2. Classify the claimed role

Before worrying about authenticity, classify the request claim by role:

Search crawler
training crawler
user-triggered fetcher
answer or retrieval system
signed or verified agent
unknown or abusive bot

That makes the next verification step much cleaner.

Step 3. Apply vendor-supported verification

Google: published IP-range objects and documented hostname patterns
OpenAI: published IP JSON files for OAI-SearchBot, GPTBot, and ChatGPT-User
Apple: reverse DNS and Apple-published CIDRs
Anthropic: no public IP ranges, so stronger identity verification is limited

Step 4. Check your infrastructure metadata

If you run Cloudflare, Akamai, or another capable edge stack, check whether the request is being surfaced as a verified or trusted bot there.

That may be more reliable than relying on origin logs alone.

Step 5. Be explicit about uncertainty

If a request cannot be strongly verified, do not write policy notes that speak as if it has been fully authenticated.

Say:

"claimed Anthropic agent"
"unverified AI-style user-agent"
"Google request verified by IP and DNS"
"OpenAI search crawler verified by published range"

That language keeps decisions honest.

Three common mistakes

1. Treating a user-agent string as full proof

It is not.

It is only the start of verification.

2. Assuming every vendor publishes the same kind of verification data

They do not.

Google, OpenAI, Apple, and Anthropic currently expose different public verification models.

3. Confusing agent authenticity with allowed purpose

A request can be authentic and still be disallowed by your policy.

Verification and permission are separate questions.

Where Better Robots.txt fits

Better Robots.txt does not replace log analysis or edge verification.

What it does do is help you classify requests correctly before turning them into policy.

That matters because the quality of the published policy depends on the quality of the classification behind it.

If you misclassify a user-triggered fetcher as a training crawler, or a signed agent as a normal crawler, the policy will be wrong before it is ever published.

The correct mental model

The safest verification model is this:

user-agent string = claim
IP and DNS = stronger evidence when published by the operator
verified bot metadata at the edge = stronger operational classification
no public verification method = explicit uncertainty, not invented confidence

That is how you read logs without fooling yourself.

How to verify AI agents in your logs without fooling yourself ​

The short version ​

Why user-agent strings are not enough ​

What can be verified well ​

Google ​

OpenAI ​

Apple ​

What is harder to verify ​

Anthropic ​

Why edge signals matter now ​

A practical verification workflow ​

Step 1. Capture the basics ​

Step 2. Classify the claimed role ​

Step 3. Apply vendor-supported verification ​

Step 4. Check your infrastructure metadata ​

Step 5. Be explicit about uncertainty ​

Three common mistakes ​

1. Treating a user-agent string as full proof ​

2. Assuming every vendor publishes the same kind of verification data ​

3. Confusing agent authenticity with allowed purpose ​

Where Better Robots.txt fits ​

The correct mental model ​

Related ​