Roughly 30-50% of all internet traffic in 2026 is bots — search engine crawlers, monitoring tools, scrapers, API clients, attackers, and a long tail of automation. For most application developers, “is this a bot?” is a constant background question for logs, analytics, rate limiting, and security.
This post covers the practical techniques for bot detection: the easy signals (User-Agent), the medium-strength signals (IP/ASN, behavior), and the hard-to-defeat techniques (fingerprinting, attestation). The trade-offs in each — and how to combine them.
Why Detect Bots
Different bots warrant different responses:
- Search engines (Googlebot, Bingbot) — Serve normally. They drive your SEO.
- Monitoring (Pingdom, UptimeRobot, Datadog) — Allow. They’re your friends.
- Aggressive scrapers — Rate-limit or block. Cost real money to serve.
- Vulnerability scanners — Block. Probing for weaknesses.
- Credential stuffers — Block hard. Definitely malicious.
- API consumers — Allow if authenticated; rate-limit otherwise.
- AI training crawlers — Increasingly controversial; some block, some allow, some negotiate.
The decision isn’t binary. It’s “what kind of bot, and what’s the appropriate response.”
Layer 1: User-Agent
The most basic signal. Bots typically identify themselves in the User-Agent string:
User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)
User-Agent: Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)
User-Agent: Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
User-Agent: python-requests/2.31.0
User-Agent: curl/8.4.0
Pattern matching catches the obvious cases:
const BOT_PATTERNS = /bot|crawler|spider|wget|curl|python-requests|httpclient|java\//i
function isLikelyBot(ua: string): boolean {
return BOT_PATTERNS.test(ua)
}
Pros:
- Easy.
- Catches the cooperative bots that identify themselves honestly.
Cons:
- Trivially spoofable. A malicious bot sends
Mozilla/5.0 (Windows...)and looks human. - Misses headless browsers running real Chrome.
- High false-positive rate for legitimate API SDKs.
For the cooperative-bot use case (search engines, monitoring), UA is enough. For everything else, it’s the start of a longer pipeline.
Layer 2: Verify Self-Identifying Bots
Major search engines provide ways to verify their bots:
Reverse DNS for Googlebot
host 66.249.66.1
# Output: 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
The reverse DNS must end in googlebot.com or google.com. Verify forward DNS also points back to the original IP. This is the standard verification flow Google publishes.
Bingbot
Similar pattern with search.msn.com.
Token-based for newer crawlers
Some newer crawlers (OpenAI’s GPTBot, Anthropic’s Claude bot, etc.) publish their IP ranges or use signed requests.
The verified-bot list is what you allow without rate limiting; everyone else falls through to deeper detection.
Layer 3: IP and ASN
Beyond UA, the IP and ASN say a lot:
Hosting ASNs
Real users browse from residential ISPs. Traffic from AWS, GCP, Azure, OVH, Hetzner is almost certainly a bot. The Ip2Geo API returns ASN classification inline.
const result = await convertIP(ip)
if (result.data.asn.type === 'hosting') {
// Hosting traffic. Bot probability: very high.
}
Tor exits
Almost always malicious or evasive. See proxy types explained for the full landscape.
Known scraper providers
Residential-proxy networks (Bright Data, Smartproxy) deliberately use residential ASNs. Detection is harder; see residential proxies explained.
Geographic anomalies
Your service is US-only. Suddenly a flood of traffic from a country you don’t serve. Even if the UA looks legitimate, the geographic mismatch is a signal.
Layer 4: Behavior
Bots act differently from humans, even if they spoof everything else:
Request rate
A human reads pages. A bot fetches them. 100 requests per second from a single IP isn’t a human.
Request patterns
Humans browse in tree-like patterns (home → category → product). Bots often crawl linearly (page 1, 2, 3, 4…) or do depth-first traversal.
Time-of-day
Human traffic follows time zones. Bot traffic is steady through the day.
Path distribution
A scraper hits product pages disproportionately. A human samples the site more broadly.
Sessionless requests
No cookies, no consistent session ID, every request stateless. Probably automated.
Interaction signals
Real users scroll, click, hover. Bots typically don’t generate mouse-move events.
For behavior, you need to instrument your application. A few requests’ data tells you little; an hour’s data per session tells you a lot.
Layer 5: JavaScript Challenges
The next escalation: require the client to execute JavaScript.
Naive challenge
Serve a page that requires JS to display the real content. Bots that don’t run JS show empty pages.
Defeated by: any headless browser (Puppeteer, Playwright, Chromium). All real bots run JS in 2026.
Cryptographic proof of work
The server gives the client a small puzzle (hash with N leading zeros). The client must compute it before getting the real content. Adds CPU cost to the client.
Slows down massive scraping but doesn’t stop it.
Browser fingerprinting
JS collects browser characteristics — canvas rendering, fonts, WebGL, audio context, navigator properties — and sends a fingerprint to the server. Genuine browsers have consistent fingerprints; headless or modified browsers have telltale differences.
This is what Cloudflare’s “checking your browser” page does, with a sophisticated fingerprinting + heuristic stack. Effective against most bots.
CAPTCHA
The classic. Show an image puzzle, voice puzzle, click challenge. Real users solve; bots can’t (mostly).
Modern reality: CAPTCHA is increasingly defeated by AI. The friction for human users is real. Modern services use invisible challenges (reCAPTCHA v3, Cloudflare Turnstile) that only fall back to interactive challenges when other signals are weak.
Layer 6: Hardware Attestation
The frontier: prove the client is real hardware, not an emulator.
App Attest (iOS) and Play Integrity (Android)
Mobile apps can request a token from Apple’s or Google’s services that attests “this is a real iPhone / Android phone, this app is the real signed binary, this device is not jailbroken.” The server validates the token against Apple/Google’s public keys.
Bots can’t generate these tokens without compromising Apple/Google’s infrastructure (which they can’t).
Trusted Platform Module (TPM)
Desktop equivalent via TPM-backed attestation. Less widely deployed for web use cases as of 2026.
Browser-side: WebAuthn / Passkeys
Not quite attestation but adjacent — proves a cryptographic key bound to a hardware authenticator.
For mobile apps with high-value workflows (financial, identity), attestation is becoming standard. For web, it’s not there yet at scale.
Combining Signals
A real production bot-detection pipeline:
async function classifyTraffic(req: Request): Promise<BotClassification> {
const ua = req.headers['user-agent'] ?? ''
// 1. Verified bots (allow normally)
if (await isVerifiedSearchBot(req.ip, ua)) return 'verified-bot'
// 2. Self-identified bots (rate-limit appropriately)
if (BOT_PATTERNS.test(ua)) return 'declared-bot'
// 3. Hosting/Tor traffic (high bot probability)
const geo = await convertIP(req.ip)
if (geo.data?.asn?.type === 'hosting') return 'likely-bot'
if (geo.data?.asn?.type === 'tor') return 'tor-traffic'
// 4. Behavioral signals from session
const session = await getSessionStats(req.sessionId)
if (session && session.requestsPerMinute > 60) return 'abusive'
if (session && session.noInteractionEvents) return 'likely-bot'
// 5. Default: probably human
return 'human'
}
Each layer adds confidence; none is sufficient alone.
Handling Different Classifications
Once you’ve classified traffic, the response depends on the kind:
verified-bot: Serve normally. Don’t rate-limit search engines.declared-bot: Allow but rate-limit. Respectrobots.txt.likely-bot: Rate-limit aggressively. Maybe challenge.abusive: Block. Log for analysis.tor-traffic: Challenge or rate-limit; some users are legitimate. See how Tor works.human: Serve normally.
Robots.txt and Bot Etiquette
A side note: well-behaved bots respect robots.txt:
User-agent: *
Crawl-delay: 10
Disallow: /private/
User-agent: GPTBot
Disallow: /
Major search engines honor it. Many AI training bots claim to honor it (with varying compliance). Malicious scrapers ignore it. robots.txt isn’t a defense — it’s a request that good bots will follow.
The Cat-and-Mouse Reality
Bot detection is a permanent arms race:
- Better detection emerges.
- Sophisticated bots learn to defeat it.
- Detection vendors update.
- Bots iterate.
In 2026, the leaders in commercial bot detection are Cloudflare, Akamai, PerimeterX (now part of Human Security), Datadome, and Imperva. They invest heavily in this arms race; your in-house solution can’t match their scale of cross-customer data.
For most production sites, “use a commercial bot management product” is the realistic answer for the hard cases. Build your own detection for the easy cases (UA, ASN, basic behavior); buy for the hard ones.
TL;DR
- 30-50% of internet traffic is bots. Detection is constant work.
- Layered signals: User-Agent → ASN/IP → behavior → JS challenges → attestation.
- Verified bots (Googlebot, Bingbot) — allow via reverse-DNS verification.
- Hosting ASNs — almost always bots; rate-limit.
- Behavior — request rate, patterns, interaction events.
- JS challenges filter low-effort bots; headless browsers defeat naive challenges.
- Fingerprinting and hardware attestation are the strongest tiers.
- Commercial bot management is the practical choice for high-stakes services.
- Different responses for different bot classes — don’t block monitoring or search engines.
Bot detection is one of those areas where naive solutions don’t scale and sophisticated solutions are commercial products. For most sites, the right architecture is “easy cases handled in-house, hard cases via Cloudflare/Akamai/etc.”. The IP/ASN layer is one of the cheap building blocks — the Ip2Geo API returns ASN classification inline. For related topics, see IP-based fraud detection and IP reputation scores.