AI crawler robots.txt: block training, allow retrieval (2026 best practice)
Not all AI crawlers are equal. Training bots (GPTBot, CCBot) scrape for model data with zero attribution. Retrieval bots (ChatGPT-User, ClaudeWeb) fetch in real-time and cite you. Here's the per-bot policy that maximizes AI visibility.
In 2026, the right AI crawler policy is no longer "allow all" or "block all". It's segmented: block training scrapers (your content becomes competitor model IP with zero attribution back), allow retrieval bots (they fetch in real time when a user asks about you and cite you in AI Overviews, ChatGPT responses, Claude answers, Perplexity citations).
Training bots — recommended BLOCK
These scrape content for model training. Once your content enters the training corpus, your IP is gone.
GPTBot— OpenAI training (ChatGPT models)Google-Extended— Google Bard/Gemini training (separate from Googlebot search)CCBot— Common Crawl (used to train many LLMs)anthropic-ai— Anthropic trainingClaudeBot— Anthropic training (current)Bytespider— ByteDance / TikTok trainingAmazonbot— Amazon AI trainingMeta-ExternalAgent— Meta Llama trainingApplebot-Extended— Apple Intelligence training opt-out
Retrieval bots — recommended ALLOW
These fetch on-demand when a user asks the AI assistant about your topic. Blocking = invisibility in AI search.
ChatGPT-User— OpenAI ChatGPT real-time retrieval (browse-the-web feature)OAI-SearchBot— OpenAI SearchGPT crawlerClaude-Web— Anthropic Claude retrieval (web_fetch tool)PerplexityBot+Perplexity-User— Perplexity with citationsDuckAssistBot— DuckDuckGo AI AssistantMistralAI-User— Mistral Le Chat retrievalYouBot— You.com search assistant
Concrete robots.txt template
# Training bots — block
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Retrieval bots — allow
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Everyone else (search engines, normal crawlers)
User-agent: *
Allow: /
Trade-off (be honest with yourself)
If your business model depends on maximum AI exposure (e.g., publisher with ad revenue from AI Overview clicks), allow training too. The cost of being absent from AI memory might exceed the IP-leakage cost. There's no universal right answer.
For most businesses (SaaS, agencies, consultancies, e-commerce), block training = win. Your content trains competitors' models otherwise.
How AuditOPE scores this
Our P2.6 phase (shipped v0.19.3) emits separate findings:geo-ai-retrieval-blocked (HIGH severity — you're losing AI citations) andgeo-ai-training-allowed (LOW severity — your content is being scraped for free). 17 bots tracked. Run a free check at auditope.com.
Want this kind of analysis on your own site?
Run a free audit →