Not all AI crawlers are equal. Training bots (GPTBot, CCBot) scrape for model data with zero attribution. Retrieval bots (ChatGPT-User, ClaudeWeb) fetch in real-time and cite you. Here's the per-bot policy that maximizes AI visibility.

In 2026, the right AI crawler policy is no longer "allow all" or "block all". It's segmented: block training scrapers (your content becomes competitor model IP with zero attribution back), allow retrieval bots (they fetch in real time when a user asks about you and cite you in AI Overviews, ChatGPT responses, Claude answers, Perplexity citations).

Training bots — recommended BLOCK

These scrape content for model training. Once your content enters the training corpus, your IP is gone.

GPTBot — OpenAI training (ChatGPT models)
Google-Extended — Google Bard/Gemini training (separate from Googlebot search)
CCBot — Common Crawl (used to train many LLMs)
anthropic-ai — Anthropic training
ClaudeBot — Anthropic training (current)
Bytespider — ByteDance / TikTok training
Amazonbot — Amazon AI training
Meta-ExternalAgent — Meta Llama training
Applebot-Extended — Apple Intelligence training opt-out

Retrieval bots — recommended ALLOW

These fetch on-demand when a user asks the AI assistant about your topic. Blocking = invisibility in AI search.

ChatGPT-User — OpenAI ChatGPT real-time retrieval (browse-the-web feature)
OAI-SearchBot — OpenAI SearchGPT crawler
Claude-Web — Anthropic Claude retrieval (web_fetch tool)
PerplexityBot + Perplexity-User — Perplexity with citations
DuckAssistBot — DuckDuckGo AI Assistant
MistralAI-User — Mistral Le Chat retrieval
YouBot — You.com search assistant

Concrete robots.txt template

# Training bots — block
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /

# Retrieval bots — allow
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /

# Everyone else (search engines, normal crawlers)
User-agent: *
Allow: /

Trade-off (be honest with yourself)

If your business model depends on maximum AI exposure (e.g., publisher with ad revenue from AI Overview clicks), allow training too. The cost of being absent from AI memory might exceed the IP-leakage cost. There's no universal right answer.

For most businesses (SaaS, agencies, consultancies, e-commerce), block training = win. Your content trains competitors' models otherwise.

How AuditOPE scores this

Our P2.6 phase (shipped v0.19.3) emits separate findings:geo-ai-retrieval-blocked (HIGH severity — you're losing AI citations) andgeo-ai-training-allowed (LOW severity — your content is being scraped for free). 17 bots tracked. Run a free check at auditope.com.

AI crawler robots.txt: block training, allow retrieval (2026 best practice)

Training bots — recommended BLOCK

Retrieval bots — recommended ALLOW

Concrete robots.txt template

Trade-off (be honest with yourself)

How AuditOPE scores this