robots.txt pentru AI bots: blochează training, permite retrieval (best practice 2026)
Nu toate AI crawlere sunt la fel. Training bots (GPTBot, CCBot) scrape pentru date model fără atribuire. Retrieval bots (ChatGPT-User, ClaudeWeb) iau real-time și te citează. Iată policy per-bot care maximizează vizibilitatea AI.
ℹ️ Articol disponibil în engleză. O versiune română completă va fi publicată în curând.
In 2026, the right AI crawler policy is no longer "allow all" or "block all". It's segmented: block training scrapers (your content becomes competitor model IP with zero attribution back), allow retrieval bots (they fetch in real time when a user asks about you and cite you in AI Overviews, ChatGPT responses, Claude answers, Perplexity citations).
Training bots — recommended BLOCK
These scrape content for model training. Once your content enters the training corpus, your IP is gone.
GPTBot— OpenAI training (ChatGPT models)Google-Extended— Google Bard/Gemini training (separate from Googlebot search)CCBot— Common Crawl (used to train many LLMs)anthropic-ai— Anthropic trainingClaudeBot— Anthropic training (current)Bytespider— ByteDance / TikTok trainingAmazonbot— Amazon AI trainingMeta-ExternalAgent— Meta Llama trainingApplebot-Extended— Apple Intelligence training opt-out
Retrieval bots — recommended ALLOW
These fetch on-demand when a user asks the AI assistant about your topic. Blocking = invisibility in AI search.
ChatGPT-User— OpenAI ChatGPT real-time retrieval (browse-the-web feature)OAI-SearchBot— OpenAI SearchGPT crawlerClaude-Web— Anthropic Claude retrieval (web_fetch tool)PerplexityBot+Perplexity-User— Perplexity with citationsDuckAssistBot— DuckDuckGo AI AssistantMistralAI-User— Mistral Le Chat retrievalYouBot— You.com search assistant
Concrete robots.txt template
# Training bots — block
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Retrieval bots — allow
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Everyone else (search engines, normal crawlers)
User-agent: *
Allow: /
Trade-off (be honest with yourself)
If your business model depends on maximum AI exposure (e.g., publisher with ad revenue from AI Overview clicks), allow training too. The cost of being absent from AI memory might exceed the IP-leakage cost. There's no universal right answer.
For most businesses (SaaS, agencies, consultancies, e-commerce), block training = win. Your content trains competitors' models otherwise.
How AuditOPE scores this
Our P2.6 phase (shipped v0.19.3) emits separate findings:geo-ai-retrieval-blocked (HIGH severity — you're losing AI citations) andgeo-ai-training-allowed (LOW severity — your content is being scraped for free). 17 bots tracked. Run a free check at auditope.com.
Vrei o analiză similară pe site-ul tău?
Rulează un audit gratuit →