Back to Blog
SEO Tools

Block AI Training Crawlers With robots.txt โ€” GPTBot, ClaudeBot, and More

2026-06-04 5 min read

AI companies crawl websites to train their models. robots.txt allows or blocks specific crawlers. Here is how to manage AI crawler access.

Since 2023, a growing number of AI companies have been crawling the web to gather training data for their large language models. If you'd rather your content not be used to train AI systems without your permission, robots.txt is the most practical tool you have right now.

How AI crawlers identify themselves

Web crawlers announce themselves with a user-agent string. AI companies have started publishing the names of their bots so site owners can block them via robots.txt. The major ones in 2026:

  • GPTBot โ€” OpenAI's training crawler
  • ChatGPT-User โ€” OpenAI's browsing bot (live queries, not training)
  • Google-Extended โ€” Google's AI training crawler (separate from Googlebot)
  • CCBot โ€” Common Crawl, the source for many open datasets
  • anthropic-ai โ€” Anthropic's crawler
  • cohere-ai โ€” Cohere's crawler
  • FacebookBot โ€” Meta's AI training crawler

The robots.txt rules to add

# Block OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Block Google AI training (keep regular Googlebot allowed)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block Anthropic
User-agent: anthropic-ai
Disallow: /

# Block Cohere
User-agent: cohere-ai
Disallow: /

An important distinction

Note that GPTBot (training) and ChatGPT-User (live browsing) are different bots. Blocking GPTBot stops your content from being used in training data. It doesn't prevent ChatGPT from browsing your site when a user actively asks it to visit a URL. If you want to block both, add a separate rule for ChatGPT-User as well.

Google-Extended is separate from Googlebot. Blocking Google-Extended stops your content from being used in Bard/Gemini training without affecting your Google Search rankings. You probably want Googlebot to keep crawling.

The honest limitation

Robots.txt is a voluntary protocol. Legitimate companies like OpenAI and Google have stated they respect it. Smaller, less reputable AI scrapers may ignore it entirely. There's no technical enforcement.

Use the Robots.txt Generator to build a clean robots.txt file with the specific bot blocks you need, then host it at yoursite.com/robots.txt.

robots-txt ai crawlers gptbot claudebot training

More Articles