AI Crawler Rules Builder & robots.txt Generator
Generate SEO-safe robots.txt rules for AI crawlers, search bots, and archive bots. Block training bots without accidentally blocking search visibility.
Choose a strategy
Traditional search
AI search visible
Training blocked
Archive blocked
User agents blocked
Search Engine
— blocking impacts SEOGooglebotSearch EngineGoogle's primary web crawler for indexing content in Google Search.
AI Search / AI Answers
— blocking impacts AI visibilityOAI-SearchBotAI Search / AI AnswersOpenAI's crawler for indexing content for ChatGPT search features.
Claude-SearchBotAI Search / AI AnswersAnthropic's crawler for indexing content for Claude's web search and answer features.
PerplexityBotAI Search / AI AnswersPerplexity AI's crawler for indexing content to use in Perplexity answers and search.
AI Training
— safe to blockGoogle-ExtendedAI TrainingA separate robots.txt control token for certain Google AI training and Bard/Gemini grounding uses, independent of Google Search.
GPTBotAI TrainingOpenAI's crawler for collecting training data for GPT models.
ClaudeBotAI TrainingAnthropic's crawler for collecting training and model grounding data for Claude.
User-Triggered Fetch
— advisory onlyChatGPT-UserUser-Triggered Fetchadvisory onlyUser-triggered fetcher used when a ChatGPT user requests a live URL to be fetched.
Claude-UserUser-Triggered Fetchadvisory onlyUser-triggered fetcher used when a Claude user causes a live URL to be fetched in a session.
Perplexity-UserUser-Triggered Fetchadvisory onlyUser-triggered fetcher used when a Perplexity user causes a live URL fetch.
Archive / Dataset
— safe to blockCCBotArchive / DatasetCommon Crawl's open web crawler that builds public datasets used by many AI labs for training.
Advanced settings (paths, wildcard group, merge mode)
Block scope
Add section comments
Adds explanatory comments to the generated robots.txt
Add User-agent: * group
Adds a wildcard group covering unlisted crawlers
Existing robots.txt merge
Paste your current robots.txt to append or replace
robots.txt
# AI Training User-agent: Google-Extended Disallow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / # Archive / Dataset User-agent: CCBot Disallow: / # All other crawlers not listed above are allowed by default
Minimal AI section only (paste into existing robots.txt)
# AI Training User-agent: Google-Extended Disallow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / # Archive / Dataset User-agent: CCBot Disallow: /
This tool runs entirely in your browser. No live robots.txt files are fetched, no URLs are visited, and no configuration is sent to a server. All output is generated locally from the registry of documented crawlers.
Overview
A few years ago, managing a robots.txt file was straightforward: allow Googlebot, maybe allow Bingbot, and optionally block a handful of known scrapers. Today, the decision is far more nuanced. Search engines, AI answer engines, AI training pipelines, user-triggered AI fetchers, and open-web archive crawlers all have distinct user-agents — and blocking the wrong one can silently remove your content from search results, AI citations, or both.
The most common mistake is treating "block AI bots" as a single action. In practice, there are at least four separate categories of crawler with AI-related user-agent tokens: traditional search (Googlebot), AI search/answer indexing (OAI-SearchBot, PerplexityBot, Claude-SearchBot), AI model training (GPTBot, ClaudeBot, Google-Extended, CCBot), and user-triggered retrieval (ChatGPT-User, Claude-User, Perplexity-User). Each has different implications for SEO, discoverability, and content governance.
Beyond robots.txt, enforcement is not uniform. Well-behaved crawlers respect Disallow directives. Some crawlers — particularly user-triggered ones — may not check robots.txt consistently before fetching a live URL on behalf of a human user. For those cases, Cloudflare AI Crawl Control or WAF rules provide a stronger enforcement layer. This tool helps you build the right robots.txt policy as a starting point, then explains where to layer on additional enforcement if needed.
Use cases
When to use it
- Blocking AI training crawlers while preserving search visibilityuse the 'Search + No Training' preset to block GPTBot, ClaudeBot, Google-Extended, and CCBot while keeping Googlebot and AI search bots active.
- Deciding whether AI answer engines should access your contentOAI-SearchBot, Claude-SearchBot, and PerplexityBot are the crawlers that feed AI search and answer experiences. The tool lets you allow or block each independently.
- Understanding the difference between Googlebot and Google-ExtendedGooglebot controls Google Search indexing. Google-Extended is a separate robots.txt token for certain AI training and grounding uses. They are completely independent — blocking Google-Extended does not affect Google Search.
- Generating clean robots.txt for deploymentcopy or download the generated robots.txt. Use the minimal AI section to append just the AI rules to an existing file without replacing your current configuration.
- Understanding Cloudflare AI Crawl Control for stronger enforcementthe Deployment tab generates Cloudflare guidance and optional Nginx/Apache snippets for cases where robots.txt alone may not be sufficient.
- Configuring path-level blocking for mixed-content sitesuse the custom path option to block specific folders (e.g. /private-docs/) rather than the whole site, which is useful for documentation sites and media platforms.
When it's not enough
- Using this to block private or sensitive contentrobots.txt is a declaration to well-behaved crawlers. It cannot prevent non-compliant bots, scrapers, or direct URL requests. Do not use it as a security measure for private content.
- Blocking Googlebot when the goal was only AI training opt-outblocking Googlebot removes your content from Google Search. If your goal is to opt out of Google AI training only, block Google-Extended instead. They are completely separate tokens.
- Expecting user-triggered bots to reliably respect robots.txtChatGPT-User, Claude-User, and Perplexity-User fire in response to human requests and may not check robots.txt before fetching. For these cases, WAF or Cloudflare enforcement is more reliable.
How to use it
- 1
Choose a preset or custom strategy
'Search + No Training' is the recommended starting point for most publishers. It keeps traditional SEO and AI answer visibility intact while opting out of training and archive collection.
- 2
Review which crawlers are allowed and blocked
expand any bot row to see its purpose, operator, and the effect of blocking it. Toggle individual bots to fine-tune the policy beyond the preset defaults.
- 3
Check warnings about SEO and discoverability consequences
the Warnings tab surfaces critical issues (like Googlebot being accidentally blocked) alongside informational notes about advisory-only robots.txt enforcement.
- 4
Add your sitemap URL in Advanced Settings
the Sitemap: line in robots.txt helps all crawlers discover your full content. Add your sitemap URL in the Advanced Settings panel.
- 5
Copy or download the robots.txt
publish the file at https://yourdomain.com/robots.txt. Note that subdomains need their own robots.txt files.
- 6
Add Cloudflare AI Crawl Control for stronger enforcement
for user-triggered bots and non-compliant crawlers, Cloudflare AI Crawl Control provides verified-bot blocking at the edge — stronger and more reliable than robots.txt alone.
Common errors and fixes
Blocking Googlebot when the goal was to block AI training
Googlebot controls Google Search indexing. If you want to opt out of Google AI training, block Google-Extended instead. These are completely separate tokens. Disabling Google-Extended does not affect your Google Search presence.
Assuming Google-Extended controls AI Overviews or Gemini answers
Google-Extended controls certain AI training and model grounding uses but does not currently control whether your content appears in AI Overviews or Gemini search results. Its scope is specific to training pipelines as documented by Google.
Thinking GPTBot and OAI-SearchBot do the same thing
GPTBot is a training crawler — blocking it opts your content out of OpenAI model training. OAI-SearchBot is a search/discoverability crawler — blocking it reduces your content's presence in ChatGPT search answers. They are independent and serve different purposes.
Blocking all AI bots with User-agent: * Disallow: / and then wondering why SEO disappeared
A wildcard Disallow: / blocks all crawlers not otherwise explicitly listed, including Googlebot. Always add an explicit User-agent: Googlebot / Allow: / section before deploying a wildcard block if you need search visibility.
Relying on robots.txt alone to block user-triggered fetchers
ChatGPT-User, Claude-User, and Perplexity-User fire in response to human user actions and may not consistently check robots.txt before fetching. For reliable enforcement, use Cloudflare AI Crawl Control, WAF custom rules, or server-side user-agent matching.
Forgetting the Sitemap: line
Adding Sitemap: https://example.com/sitemap.xml to robots.txt helps all crawlers discover your full content index. This is a low-effort improvement that benefits both traditional search and AI search crawlers.
Treating robots.txt as a privacy control
robots.txt is a public declaration readable by anyone. Non-compliant bots can ignore it entirely. For private content, use authentication, access controls, and server-level restrictions.
Frequently asked questions
Related
How do I block AI training bots without hurting my SEO?
The key is treating AI training crawlers and search crawlers as completely separate. The safe approach for most publishers is to explicitly block training-specific tokens —GPTBot,ClaudeBot,Google-Extended, and CCBot — while keeping Googlebot and traditional search crawlers allowed. AI search bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot) are a separate decision: blocking them reduces your presence in AI answer engines but does not affect traditional search rankings.
The "Search + No Training" preset in this tool applies exactly this strategy and is the recommended starting point for most publishers.
What is the difference between Googlebot and Google-Extended?
Googlebot is Google's primary web crawling user-agent. It indexes content for Google Search. Blocking Googlebot removes your pages from Google Search results — almost never the right choice for sites that want search visibility.
Google-Extended is a separate robots.txt control token — not a crawling user-agent in the traditional HTTP sense. It lets publishers opt out of certain Google AI training and model grounding uses without affecting Googlebot or Google Search. Blocking Google-Extended does not affect your Google Search ranking, visibility, or appearance in AI Overviews. The two tokens are completely independent and must be addressed separately.
GPTBot vs OAI-SearchBot vs ChatGPT-User
| Token | Category | Purpose | Effect of blocking |
|---|---|---|---|
| GPTBot | AI Training | Collects data for OpenAI model training | Opts out of OpenAI model training — no effect on ChatGPT search |
| OAI-SearchBot | AI Search | Indexes content for ChatGPT search and answer features | Reduces/removes content from ChatGPT search answers — no effect on training |
| ChatGPT-User | User-Triggered Fetch | Fetches URLs when a human user requests them in ChatGPT | May prevent ChatGPT from fetching URLs in user sessions — advisory only |
robots.txt for AI crawler control: what it can and cannot do
- ✓Good for declaring your crawling preferences to well-behaved bots
- ✓Effective at opting out of training and archiving by compliant crawlers
- ✗Not a security control — cannot prevent non-compliant bots or scrapers
- ✗Not reliable for private content — any bot can read and ignore the file
- ~Inconsistently enforced for user-triggered fetchers (ChatGPT-User, etc.)
- ~WAF, Cloudflare AI Crawl Control, or server rules needed for stronger enforcement
AI crawler cheat sheet
| Bot / Token | Category | Main purpose | Reason to allow | Reason to block | Consequence of blocking |
|---|---|---|---|---|---|
| Googlebot | Search Engine | Google Search indexing | Essential for search visibility | Almost never — staging environments only | Content removed from Google Search |
| Google-Extended | AI Training | Google AI training opt-out token | Allow Google AI grounding uses | Opt out of Google AI training | Opted out of certain Google AI training — no SEO impact |
| GPTBot | AI Training | OpenAI model training | Consent to OpenAI training data use | Opt out of OpenAI model training | No more training use — ChatGPT search unaffected |
| OAI-SearchBot | AI Search | ChatGPT search indexing | Appear in ChatGPT search answers | Block ChatGPT search presence | Reduced visibility in ChatGPT answers |
| ClaudeBot | AI Training | Anthropic model training | Consent to Anthropic training data use | Opt out of Anthropic training | No more training use — Claude search unaffected |
| Claude-SearchBot | AI Search | Claude answer engine indexing | Appear in Claude's search answers | Block Claude search presence | Reduced visibility in Claude answers |
| Claude-User | User-Triggered | User-initiated live URL fetch in Claude | Allow per-user content retrieval | Limit live retrieval in Claude sessions | Advisory — WAF needed for reliable enforcement |
| PerplexityBot | AI Search | Perplexity search indexing | Appear in Perplexity answers | Block Perplexity presence | Reduced visibility in Perplexity results |
| Perplexity-User | User-Triggered | User-initiated fetch in Perplexity | Allow per-user retrieval | Limit live retrieval in Perplexity | Advisory — WAF needed for reliable enforcement |
| CCBot | Archive/Dataset | Common Crawl dataset building | Inclusion in academic/research archives | Opt out of Common Crawl and downstream AI training | Excluded from future Common Crawl snapshots |
Deployment guidance
Where robots.txt must live: the file must be at the root of your domain — https://example.com/robots.txt. Each subdomain requires its own robots.txt file. A robots.txt at example.com does not apply to docs.example.com.
Cloudflare AI Crawl Control: available in the Cloudflare dashboard under Security → Bots → AI Scrapers & Crawlers. This provides verified-bot enforcement at the edge — stronger and more reliable than robots.txt alone, and especially useful for user-triggered fetchers that may not check robots.txt. Cloudflare can also track and log robots.txt violations from listed crawlers.
# Minimal robots.txt with sitemap (copy template) # Generated by CodeAva AI Crawler Rules Builder # AI Training — block training, keep search User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / # All other crawlers (including Googlebot) remain allowed Sitemap: https://example.com/sitemap.xml