How to read crawl logs and identify unwanted bots
Your robots.txt can only block bots you know about. And the only reliable way to know which bots visit your site is to read your server logs. Everything else — analytics dashboards, plugin reports, traffic estimates — is a filtered, incomplete view. The logs are the ground truth.
Where to find your logs
On most WordPress hosting environments, access logs are stored in a standard location. On Apache servers, the file is typically at /var/log/apache2/access.log or in your hosting control panel under "Raw Access Logs." On Nginx servers, it is usually at /var/log/nginx/access.log. Managed WordPress hosts like WP Engine, Kinsta, or Cloudways provide log access through their dashboards or via SFTP.
If you use Cloudflare, Sucuri, or another CDN or security layer, be aware that some bot traffic may be filtered before it reaches your origin server. In that case, the CDN's analytics or logs provide a more complete picture than your server logs alone.
Anatomy of a log line
A typical Apache access log entry looks like this:
66.249.68.42 - - [15/Mar/2026:08:23:17 +0000] "GET /blog/ai-crawlers/ HTTP/1.1" 200 15234 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"The fields that matter for bot identification are:
The IP address (66.249.68.42) identifies the requesting server. For known bots, this can be cross-referenced against published IP ranges to verify authenticity.
The request path (/blog/ai-crawlers/) shows which page was requested. Patterns in requested paths reveal which sections of your site attract the most bot attention.
The status code (200) shows whether the request succeeded. A high number of 404 responses from a specific bot indicates it is following outdated links or probing for paths that do not exist.
The user-agent string (the last quoted field) is how the bot identifies itself. This is the most important field for bot identification, though it can be spoofed.
Identifying bot categories
By filtering your logs for the user-agent field, you can categorize all bot traffic into actionable groups:
Search engine crawlers identify themselves clearly: Googlebot, Bingbot, YandexBot, Baiduspider. These are the bots you almost always want to allow. Their IP ranges are publicly documented, which means you can verify that a request claiming to be Googlebot actually comes from Google.
AI crawlers use specific agent strings: GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended, PerplexityBot, Applebot-Extended, FacebookBot, Amazonbot, Meta-ExternalAgent. Each represents a distinct operator with different crawl behaviors and different reasons for accessing your content.
SEO tools and competitive scrapers include AhrefsBot, SemrushBot, DotBot, MJ12bot, and dozens of others. These bots index your site for their respective platforms' competitive analysis databases. Whether to allow them is a business decision — they provide no direct benefit to your site.
Archive bots include ia_archiver (Internet Archive), archive.org_bot, and similar. They create historical snapshots of your site for the Wayback Machine and similar services.
Unidentified or generic agents are the most concerning category. Bots that use a generic browser user-agent (pretending to be Chrome or Firefox) or that provide no identifying information are either poorly configured, deliberately deceptive, or malicious. Legitimate bots identify themselves. Bots that hide their identity rarely have good reasons for doing so.
Practical log analysis
You do not need specialized tools for a basic bot audit. On a Linux server, a few commands extract the information you need:
To see all unique user agents in your log: sort the user-agent field, remove duplicates, and count occurrences. This gives you a ranked list of every bot that visited your site, ordered by frequency.
To see which paths a specific bot requested: filter the log for that user-agent string and extract the request path. This reveals whether the bot is focused on your important content or is crawling low-value pages.
To see the request volume over time: count requests per hour or per day for each bot. A sudden spike in requests from a specific agent is worth investigating — it may indicate a new training run, a misconfigured crawler, or a scraping attack.
To detect agent spoofing: cross-reference the IP address of a request claiming to be Googlebot against Google's published IP ranges. If the IP does not match, the request is not from Googlebot, regardless of what the user-agent string says.
Turning data into rules
The goal of log analysis is to produce actionable robots.txt rules. The process is:
- Identify bots that visit your site regularly.
- Categorize each bot: search engine, AI crawler, SEO tool, archive service, or unknown.
- For each category, decide whether to allow or restrict access.
- Add specific User-agent blocks in robots.txt for bots you want to restrict.
- For bots that ignore robots.txt or spoof their identity, escalate to server-level blocks (IP bans, rate limiting, or WAF rules).
Better Robots.txt includes a curated database of known bots organized by category, which maps directly to the categories you see in your logs. This makes the translation from log analysis to robots.txt configuration straightforward: identify the bot in your logs, find it in the plugin's bot list, and set the appropriate policy.
Establishing a routine
A single log audit is useful. A regular audit is transformative. New bots appear constantly. Existing bots change their behavior. Crawl volumes shift as AI companies launch new training cycles or new products.
The minimum cadence is quarterly. For high-traffic sites or sites with valuable original content, monthly is better. Each audit takes 15 to 30 minutes and produces direct, measurable improvements to your crawl governance posture.
The sites that control their bot traffic are the sites that know their bot traffic. Everything starts with reading the logs.