How to verify AI agents in your logs without fooling yourself
A claimed user-agent string is not the same thing as a verified agent.
That sounds obvious. But it is still where many teams go wrong.
They see Google-Agent, ChatGPT-User, Claude-SearchBot, or Applebot in the logs and immediately assume identity, purpose, and compliance.
That is too much confidence for too little evidence.
The correct rule is simple:
first classify the request claim, then verify what can actually be verified.
The short version
A practical verification workflow has five steps:
- identify the claimed user-agent string;
- verify with vendor-supported methods when those exist;
- use reverse and forward DNS or published IP ranges where the vendor documents them;
- use CDN or WAF verified-bot signals when available;
- if verification is weak or impossible, say so.
That last step matters as much as the others.
Why user-agent strings are not enough
Major operators themselves warn about this.
Google explicitly cautions that the user-agent string can be spoofed.
That means a log line containing Google-Agent or Googlebot is not proof by itself.
The same principle should be applied to every other operator:
- a claimed string is evidence of an identity claim;
- it is not full proof of authenticity.
This is also why How to read crawl logs and identify unwanted bots and this article belong together but are not the same article.
Log reading tells you what the request claims to be. Verification tells you how much you can trust that claim.
What can be verified well
Some operators provide strong public verification methods.
Google
Google documents both user-triggered fetchers and request verification.
For user-triggered fetchers, Google publishes JSON objects for the relevant IP ranges and explains that these requests resolve to either google.com or gae.googleusercontent.com hostnames, depending on the fetcher class.
That gives you a meaningful workflow:
- extract the request IP;
- check whether it belongs to the published range set;
- perform reverse DNS and then forward-confirm the hostname if needed;
- only then classify it as verified Google traffic.
OpenAI
OpenAI publishes IP address JSON files for:
OAI-SearchBotGPTBotChatGPT-User
That means those three surfaces are not limited to a claimed user-agent string. They also have public IP-range references you can use during verification.
OpenAI separately documents ChatGPT agent as a signed agent that can be allowlisted in edge ecosystems such as Akamai, Cloudflare, and HUMAN.
That means part of OpenAI verification may happen through your infrastructure provider rather than through origin-log inspection alone.
Apple
Apple says Applebot traffic is generally identified through reverse DNS in the *.applebot.apple.com domain.
It also publishes CIDR prefixes in a JSON file.
That gives Apple a reasonably strong public verification story:
- reverse DNS;
- published prefixes;
- consistent Applebot naming.
What is harder to verify
Not every operator publishes the same level of evidence.
Anthropic
Anthropic explicitly says it does not currently publish IP ranges because it uses service provider public IPs.
It also warns that IP-based blocking may not work correctly or persistently guarantee an opt-out, because it can stop Anthropic from reading robots.txt.
That means Anthropic traffic is harder to verify strongly through public IP-range methods than Google, OpenAI, or Apple traffic.
So the honest workflow is different:
- read the claimed user-agent string;
- assess request behavior and consistency;
- rely on published policy controls such as
robots.txt; - avoid pretending you have strong signature-level proof when you do not.
This is a place where honesty is more valuable than fake precision.
Why edge signals matter now
Verification is no longer only an origin-log problem.
Cloudflare’s verified bot system lets you segment traffic by cf.verified_bot_category, including categories such as:
AI AssistantAI CrawlerAI Search
That is useful because it lets you move from:
"the request says it is something"
to:
"the edge has classified this request in a verified bot framework we can act on".
That does not replace all vendor-specific checks. But it can be one of the cleanest operational signals you have.
It is also one reason Robots.txt vs signed agent allowlisting is now a core article rather than a niche technical note.
A practical verification workflow
Use the following sequence.
Step 1. Capture the basics
For each suspicious or important machine request, capture:
- timestamp;
- IP address;
- full user-agent string;
- requested path;
- request rate and pattern;
- response code.
Step 2. Classify the claimed role
Before worrying about authenticity, classify the request claim by role:
- Search crawler
- training crawler
- user-triggered fetcher
- answer or retrieval system
- signed or verified agent
- unknown or abusive bot
That makes the next verification step much cleaner.
Step 3. Apply vendor-supported verification
- Google: published IP-range objects and documented hostname patterns
- OpenAI: published IP JSON files for
OAI-SearchBot,GPTBot, andChatGPT-User - Apple: reverse DNS and Apple-published CIDRs
- Anthropic: no public IP ranges, so stronger identity verification is limited
Step 4. Check your infrastructure metadata
If you run Cloudflare, Akamai, or another capable edge stack, check whether the request is being surfaced as a verified or trusted bot there.
That may be more reliable than relying on origin logs alone.
Step 5. Be explicit about uncertainty
If a request cannot be strongly verified, do not write policy notes that speak as if it has been fully authenticated.
Say:
- "claimed Anthropic agent"
- "unverified AI-style user-agent"
- "Google request verified by IP and DNS"
- "OpenAI search crawler verified by published range"
That language keeps decisions honest.
Three common mistakes
1. Treating a user-agent string as full proof
It is not.
It is only the start of verification.
2. Assuming every vendor publishes the same kind of verification data
They do not.
Google, OpenAI, Apple, and Anthropic currently expose different public verification models.
3. Confusing agent authenticity with allowed purpose
A request can be authentic and still be disallowed by your policy.
Verification and permission are separate questions.
Where Better Robots.txt fits
Better Robots.txt does not replace log analysis or edge verification.
What it does do is help you classify requests correctly before turning them into policy.
That matters because the quality of the published policy depends on the quality of the classification behind it.
If you misclassify a user-triggered fetcher as a training crawler, or a signed agent as a normal crawler, the policy will be wrong before it is ever published.
The correct mental model
The safest verification model is this:
- user-agent string = claim
- IP and DNS = stronger evidence when published by the operator
- verified bot metadata at the edge = stronger operational classification
- no public verification method = explicit uncertainty, not invented confidence
That is how you read logs without fooling yourself.