All tools

AI Crawler Rules Builder & robots.txt Generator

Generate SEO-safe robots.txt rules for AI crawlers, search bots, and archive bots. Block training bots without accidentally blocking search visibility.

Choose a strategy

SEO: Full traditional search visibilityAI Search: AI answer and search visibility preservedTraining: Opted out of AI model training

Traditional search

AI search visible

Training blocked

Archive blocked

User agents blocked

Search Engine

— blocking impacts SEO
GooglebotGooglebotSearch Engine

Google's primary web crawler for indexing content in Google Search.

Allow

AI Search / AI Answers

— blocking impacts AI visibility
OAI-SearchBotOAI-SearchBotAI Search / AI Answers

OpenAI's crawler for indexing content for ChatGPT search features.

Allow
Claude-SearchBotClaude-SearchBotAI Search / AI Answers

Anthropic's crawler for indexing content for Claude's web search and answer features.

Allow
PerplexityBotPerplexityBotAI Search / AI Answers

Perplexity AI's crawler for indexing content to use in Perplexity answers and search.

Allow

AI Training

— safe to block
Google-ExtendedGoogle-ExtendedAI Training

A separate robots.txt control token for certain Google AI training and Bard/Gemini grounding uses, independent of Google Search.

Block
GPTBotGPTBotAI Training

OpenAI's crawler for collecting training data for GPT models.

Block
ClaudeBotClaudeBotAI Training

Anthropic's crawler for collecting training and model grounding data for Claude.

Block

User-Triggered Fetch

— advisory only
ChatGPT-UserChatGPT-UserUser-Triggered Fetchadvisory only

User-triggered fetcher used when a ChatGPT user requests a live URL to be fetched.

Allow
Claude-UserClaude-UserUser-Triggered Fetchadvisory only

User-triggered fetcher used when a Claude user causes a live URL to be fetched in a session.

Allow
Perplexity-UserPerplexity-UserUser-Triggered Fetchadvisory only

User-triggered fetcher used when a Perplexity user causes a live URL fetch.

Allow

Archive / Dataset

— safe to block
CCBot (Common Crawl)CCBotArchive / Dataset

Common Crawl's open web crawler that builds public datasets used by many AI labs for training.

Block
Advanced settings (paths, wildcard group, merge mode)

Block scope

Add section comments

Adds explanatory comments to the generated robots.txt

Add User-agent: * group

Adds a wildcard group covering unlisted crawlers

Existing robots.txt merge

Paste your current robots.txt to append or replace

robots.txt

# AI Training
User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Archive / Dataset
User-agent: CCBot
Disallow: /

# All other crawlers not listed above are allowed by default
Minimal AI section only (paste into existing robots.txt)
# AI Training
User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Archive / Dataset
User-agent: CCBot
Disallow: /

This tool runs entirely in your browser. No live robots.txt files are fetched, no URLs are visited, and no configuration is sent to a server. All output is generated locally from the registry of documented crawlers.

Overview

A few years ago, managing a robots.txt file was straightforward: allow Googlebot, maybe allow Bingbot, and optionally block a handful of known scrapers. Today, the decision is far more nuanced. Search engines, AI answer engines, AI training pipelines, user-triggered AI fetchers, and open-web archive crawlers all have distinct user-agents — and blocking the wrong one can silently remove your content from search results, AI citations, or both.

The most common mistake is treating "block AI bots" as a single action. In practice, there are at least four separate categories of crawler with AI-related user-agent tokens: traditional search (Googlebot), AI search/answer indexing (OAI-SearchBot, PerplexityBot, Claude-SearchBot), AI model training (GPTBot, ClaudeBot, Google-Extended, CCBot), and user-triggered retrieval (ChatGPT-User, Claude-User, Perplexity-User). Each has different implications for SEO, discoverability, and content governance.

Beyond robots.txt, enforcement is not uniform. Well-behaved crawlers respect Disallow directives. Some crawlers — particularly user-triggered ones — may not check robots.txt consistently before fetching a live URL on behalf of a human user. For those cases, Cloudflare AI Crawl Control or WAF rules provide a stronger enforcement layer. This tool helps you build the right robots.txt policy as a starting point, then explains where to layer on additional enforcement if needed.

Use cases

When to use it

  • Blocking AI training crawlers while preserving search visibilityuse the 'Search + No Training' preset to block GPTBot, ClaudeBot, Google-Extended, and CCBot while keeping Googlebot and AI search bots active.
  • Deciding whether AI answer engines should access your contentOAI-SearchBot, Claude-SearchBot, and PerplexityBot are the crawlers that feed AI search and answer experiences. The tool lets you allow or block each independently.
  • Understanding the difference between Googlebot and Google-ExtendedGooglebot controls Google Search indexing. Google-Extended is a separate robots.txt token for certain AI training and grounding uses. They are completely independent — blocking Google-Extended does not affect Google Search.
  • Generating clean robots.txt for deploymentcopy or download the generated robots.txt. Use the minimal AI section to append just the AI rules to an existing file without replacing your current configuration.
  • Understanding Cloudflare AI Crawl Control for stronger enforcementthe Deployment tab generates Cloudflare guidance and optional Nginx/Apache snippets for cases where robots.txt alone may not be sufficient.
  • Configuring path-level blocking for mixed-content sitesuse the custom path option to block specific folders (e.g. /private-docs/) rather than the whole site, which is useful for documentation sites and media platforms.

When it's not enough

  • Using this to block private or sensitive contentrobots.txt is a declaration to well-behaved crawlers. It cannot prevent non-compliant bots, scrapers, or direct URL requests. Do not use it as a security measure for private content.
  • Blocking Googlebot when the goal was only AI training opt-outblocking Googlebot removes your content from Google Search. If your goal is to opt out of Google AI training only, block Google-Extended instead. They are completely separate tokens.
  • Expecting user-triggered bots to reliably respect robots.txtChatGPT-User, Claude-User, and Perplexity-User fire in response to human requests and may not check robots.txt before fetching. For these cases, WAF or Cloudflare enforcement is more reliable.

How to use it

  1. 1

    Choose a preset or custom strategy

    'Search + No Training' is the recommended starting point for most publishers. It keeps traditional SEO and AI answer visibility intact while opting out of training and archive collection.

  2. 2

    Review which crawlers are allowed and blocked

    expand any bot row to see its purpose, operator, and the effect of blocking it. Toggle individual bots to fine-tune the policy beyond the preset defaults.

  3. 3

    Check warnings about SEO and discoverability consequences

    the Warnings tab surfaces critical issues (like Googlebot being accidentally blocked) alongside informational notes about advisory-only robots.txt enforcement.

  4. 4

    Add your sitemap URL in Advanced Settings

    the Sitemap: line in robots.txt helps all crawlers discover your full content. Add your sitemap URL in the Advanced Settings panel.

  5. 5

    Copy or download the robots.txt

    publish the file at https://yourdomain.com/robots.txt. Note that subdomains need their own robots.txt files.

  6. 6

    Add Cloudflare AI Crawl Control for stronger enforcement

    for user-triggered bots and non-compliant crawlers, Cloudflare AI Crawl Control provides verified-bot blocking at the edge — stronger and more reliable than robots.txt alone.

Common errors and fixes

Blocking Googlebot when the goal was to block AI training

Googlebot controls Google Search indexing. If you want to opt out of Google AI training, block Google-Extended instead. These are completely separate tokens. Disabling Google-Extended does not affect your Google Search presence.

Assuming Google-Extended controls AI Overviews or Gemini answers

Google-Extended controls certain AI training and model grounding uses but does not currently control whether your content appears in AI Overviews or Gemini search results. Its scope is specific to training pipelines as documented by Google.

Thinking GPTBot and OAI-SearchBot do the same thing

GPTBot is a training crawler — blocking it opts your content out of OpenAI model training. OAI-SearchBot is a search/discoverability crawler — blocking it reduces your content's presence in ChatGPT search answers. They are independent and serve different purposes.

Blocking all AI bots with User-agent: * Disallow: / and then wondering why SEO disappeared

A wildcard Disallow: / blocks all crawlers not otherwise explicitly listed, including Googlebot. Always add an explicit User-agent: Googlebot / Allow: / section before deploying a wildcard block if you need search visibility.

Relying on robots.txt alone to block user-triggered fetchers

ChatGPT-User, Claude-User, and Perplexity-User fire in response to human user actions and may not consistently check robots.txt before fetching. For reliable enforcement, use Cloudflare AI Crawl Control, WAF custom rules, or server-side user-agent matching.

Forgetting the Sitemap: line

Adding Sitemap: https://example.com/sitemap.xml to robots.txt helps all crawlers discover your full content index. This is a low-effort improvement that benefits both traditional search and AI search crawlers.

Treating robots.txt as a privacy control

robots.txt is a public declaration readable by anyone. Non-compliant bots can ignore it entirely. For private content, use authentication, access controls, and server-level restrictions.

Frequently asked questions

How do I block AI training bots without hurting my SEO?

The key is treating AI training crawlers and search crawlers as completely separate. The safe approach for most publishers is to explicitly block training-specific tokens —GPTBot,ClaudeBot,Google-Extended, and CCBot — while keeping Googlebot and traditional search crawlers allowed. AI search bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot) are a separate decision: blocking them reduces your presence in AI answer engines but does not affect traditional search rankings.

The "Search + No Training" preset in this tool applies exactly this strategy and is the recommended starting point for most publishers.

What is the difference between Googlebot and Google-Extended?

Googlebot is Google's primary web crawling user-agent. It indexes content for Google Search. Blocking Googlebot removes your pages from Google Search results — almost never the right choice for sites that want search visibility.

Google-Extended is a separate robots.txt control token — not a crawling user-agent in the traditional HTTP sense. It lets publishers opt out of certain Google AI training and model grounding uses without affecting Googlebot or Google Search. Blocking Google-Extended does not affect your Google Search ranking, visibility, or appearance in AI Overviews. The two tokens are completely independent and must be addressed separately.

GPTBot vs OAI-SearchBot vs ChatGPT-User

TokenCategoryPurposeEffect of blocking
GPTBotAI TrainingCollects data for OpenAI model trainingOpts out of OpenAI model training — no effect on ChatGPT search
OAI-SearchBotAI SearchIndexes content for ChatGPT search and answer featuresReduces/removes content from ChatGPT search answers — no effect on training
ChatGPT-UserUser-Triggered FetchFetches URLs when a human user requests them in ChatGPTMay prevent ChatGPT from fetching URLs in user sessions — advisory only

robots.txt for AI crawler control: what it can and cannot do

  • Good for declaring your crawling preferences to well-behaved bots
  • Effective at opting out of training and archiving by compliant crawlers
  • Not a security control — cannot prevent non-compliant bots or scrapers
  • Not reliable for private content — any bot can read and ignore the file
  • ~Inconsistently enforced for user-triggered fetchers (ChatGPT-User, etc.)
  • ~WAF, Cloudflare AI Crawl Control, or server rules needed for stronger enforcement

AI crawler cheat sheet

Bot / TokenCategoryMain purposeReason to allowReason to blockConsequence of blocking
GooglebotSearch EngineGoogle Search indexingEssential for search visibilityAlmost never — staging environments onlyContent removed from Google Search
Google-ExtendedAI TrainingGoogle AI training opt-out tokenAllow Google AI grounding usesOpt out of Google AI trainingOpted out of certain Google AI training — no SEO impact
GPTBotAI TrainingOpenAI model trainingConsent to OpenAI training data useOpt out of OpenAI model trainingNo more training use — ChatGPT search unaffected
OAI-SearchBotAI SearchChatGPT search indexingAppear in ChatGPT search answersBlock ChatGPT search presenceReduced visibility in ChatGPT answers
ClaudeBotAI TrainingAnthropic model trainingConsent to Anthropic training data useOpt out of Anthropic trainingNo more training use — Claude search unaffected
Claude-SearchBotAI SearchClaude answer engine indexingAppear in Claude's search answersBlock Claude search presenceReduced visibility in Claude answers
Claude-UserUser-TriggeredUser-initiated live URL fetch in ClaudeAllow per-user content retrievalLimit live retrieval in Claude sessionsAdvisory — WAF needed for reliable enforcement
PerplexityBotAI SearchPerplexity search indexingAppear in Perplexity answersBlock Perplexity presenceReduced visibility in Perplexity results
Perplexity-UserUser-TriggeredUser-initiated fetch in PerplexityAllow per-user retrievalLimit live retrieval in PerplexityAdvisory — WAF needed for reliable enforcement
CCBotArchive/DatasetCommon Crawl dataset buildingInclusion in academic/research archivesOpt out of Common Crawl and downstream AI trainingExcluded from future Common Crawl snapshots

Deployment guidance

Where robots.txt must live: the file must be at the root of your domain — https://example.com/robots.txt. Each subdomain requires its own robots.txt file. A robots.txt at example.com does not apply to docs.example.com.

Cloudflare AI Crawl Control: available in the Cloudflare dashboard under Security → Bots → AI Scrapers & Crawlers. This provides verified-bot enforcement at the edge — stronger and more reliable than robots.txt alone, and especially useful for user-triggered fetchers that may not check robots.txt. Cloudflare can also track and log robots.txt violations from listed crawlers.

# Minimal robots.txt with sitemap (copy template)
# Generated by CodeAva AI Crawler Rules Builder

# AI Training — block training, keep search
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# All other crawlers (including Googlebot) remain allowed

Sitemap: https://example.com/sitemap.xml