Technical GEO12 min read

robots.txt for AI Crawlers: The Complete Guide (+ Ready-to-Paste File)

Every AI crawler user-agent, what each one does, and the exact robots.txt file you should deploy. Covers GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and 12 more.

By Frederik Smits · Online Marketing Expert

Your robots.txt file is one of the simplest pieces of configuration on your website, and one of the most consequential for AI visibility. Get it right and every major AI assistant can crawl, cite, and recommend you. Get it wrong — or leave it out entirely — and you're invisible to the fastest-growing discovery channel on the internet.

This guide covers every AI crawler user-agent that currently exists, what each one does, whether you should allow it, and the exact robots.txt file you can deploy today.

TL;DR: If your goal is maximum AI visibility (most businesses), allow every AI crawler by default and disallow only sensitive paths (admin, checkout, private dashboards). The ready-to-paste file is below.

What robots.txt does (and doesn't do)

robots.txt is a plain text file at yourdomain.com/robots.txt that tells crawlers — by user-agent string — which paths they may or may not access. It works on the honor system: well-behaved crawlers read it and comply; malicious scrapers ignore it. Every major AI company (OpenAI, Anthropic, Google, Microsoft, Perplexity, Apple, Meta, Mistral) has publicly committed to respecting robots.txt for their AI training and retrieval bots.

robots.txt controls crawling, not indexing. A page blocked in robots.txt can still appear in search results if another page links to it — the crawler just won't fetch and parse the content. To fully exclude a page from AI citation, you typically need both robots.txt and a noindex meta tag.

There are two categories of AI crawler that matter:

  1. Training bots — crawl the web to build training datasets for future AI models. Blocking these doesn't affect real-time AI responses about your business, but over time it removes your content from the data AI models are trained on.
  2. Retrieval bots — fetch your pages in real time when a user asks an AI assistant a question that triggers web search. Blocking these means AI assistants cannot surface or cite your content in their answers. For a business trying to appear in AI search results, this is the more impactful category to get right.
21+
distinct AI crawler user-agents currently in the wild
73%
of high-traffic sites have no AI crawler rules at all
15-40%
of websites actively block at least one AI crawler
0
paid options exist to boost AI crawler priority

The complete AI crawler user-agent list

Here is every AI-related user-agent you'll encounter, grouped by parent company, with what each bot actually does and whether we recommend allowing it for a business that wants maximum AI visibility.

OpenAI (ChatGPT)

🧠
GPTBot
Crawls for training data used in future GPT model versions. Allowing it means your content may help shape future model knowledge about your brand.
🔍
OAI-SearchBot
Real-time retrieval for ChatGPT Search. Blocking this means ChatGPT Search cannot cite your pages. Allow.
👤
ChatGPT-User
Fetches pages when a ChatGPT user clicks "browse the web" or pastes a URL. Represents actual user activity. Allow.

Anthropic (Claude)

🧠
ClaudeBot
Training crawler for Claude model development. Allowing ensures future Claude versions know about your brand.
🌐
Claude-Web
Real-time retrieval when Claude fetches pages during conversations. Allow for AI visibility.
🕸️
anthropic-ai
Legacy and general-purpose Anthropic crawler. Allow unless you have a specific reason to restrict.

Google (AI Overviews, Gemini)

🔬
Google-Extended
Controls whether Google uses your content in Bard, Gemini, and AI Overviews. Blocking this cuts you out of Gemini and AI Overviews entirely.
🤖
GoogleOther
Secondary Google crawler for product-specific research. Allow.
🕷️
Googlebot
Standard search crawler. Not technically an AI bot but required for general Google indexing, which is the prerequisite for AI Overview inclusion.

Perplexity

🔍
PerplexityBot
Primary crawler for Perplexity's search engine. Blocking this means Perplexity cannot cite you. Perplexity is the highest-citation-intent AI platform — allow.
👤
Perplexity-User
Real-time user-triggered fetches. Allow.

Microsoft (Bing Copilot)

🌐
Bingbot
Standard Bing crawler. Powers Bing Copilot answers. Allow — Bing Copilot has growing share, and ChatGPT Search uses Bing's index under the hood.

Apple (Apple Intelligence)

🍎
Applebot
Standard Apple crawler (Siri suggestions, Spotlight, etc.). Usually already allowed.
🧠
Applebot-Extended
Controls Apple Intelligence training use. Apple's AI efforts are growing; allow for future-proofing.

Meta (Meta AI)

💬
FacebookBot
Legacy Facebook crawler. Required for social sharing previews on Facebook and Instagram. Allow.
🧠
Meta-ExternalAgent
Meta AI's training and retrieval crawler. Allow for Meta AI visibility across Facebook, Instagram, and WhatsApp surfaces.

Others (Mistral, DuckDuckGo, Common Crawl)

🇫🇷
MistralAI-User
User-triggered retrieval for Mistral's AI products. Small share but growing, especially in EU.
🦆
DuckAssistBot
DuckDuckGo's AI assistant crawler. Allow.
🌍
CCBot
Common Crawl — the open dataset that most LLMs are trained on, including GPT and Llama variants. Blocking CCBot removes you from a huge share of all LLM training. Allow.

The ready-to-paste robots.txt

Here's a complete robots.txt file you can drop in at yourdomain.com/robots.txt today. It allows every major AI crawler, disallows common sensitive paths, and references your sitemap.

# robots.txt — AI-friendly, GEO-optimized
# Last reviewed: [DATE]

# Default rules for unnamed crawlers
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /auth/
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: anthropic-ai
Allow: /

# Google
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Microsoft / Bing Copilot
User-agent: Bingbot
Allow: /

# Apple Intelligence
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

# Meta AI
User-agent: FacebookBot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

# DuckDuckGo
User-agent: DuckAssistBot
Allow: /

# Mistral
User-agent: MistralAI-User
Allow: /

# Common Crawl (feeds many LLM training datasets)
User-agent: CCBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml
Replace yourdomain.com with your actual domain and update the disallow list to match your site structure. Don't leave the placeholder in.

Next.js apps: use robots.ts instead

If you're on Next.js (App Router), the idiomatic approach is to create app/robots.ts and let the framework generate the file at build time. Here's the equivalent:

import type { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/api/', '/admin/', '/dashboard/', '/auth/'],
      },
      { userAgent: 'GPTBot', allow: '/' },
      { userAgent: 'ChatGPT-User', allow: '/' },
      { userAgent: 'OAI-SearchBot', allow: '/' },
      { userAgent: 'ClaudeBot', allow: '/' },
      { userAgent: 'Claude-Web', allow: '/' },
      { userAgent: 'anthropic-ai', allow: '/' },
      { userAgent: 'Google-Extended', allow: '/' },
      { userAgent: 'GoogleOther', allow: '/' },
      { userAgent: 'PerplexityBot', allow: '/' },
      { userAgent: 'Perplexity-User', allow: '/' },
      { userAgent: 'Bingbot', allow: '/' },
      { userAgent: 'Applebot-Extended', allow: '/' },
      { userAgent: 'Meta-ExternalAgent', allow: '/' },
      { userAgent: 'CCBot', allow: '/' },
      { userAgent: 'MistralAI-User', allow: '/' },
      { userAgent: 'DuckAssistBot', allow: '/' },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
    host: 'https://yourdomain.com',
  };
}

When you should block AI crawlers

The default recommendation is “allow everything.” But there are legitimate reasons to block specific bots or paths.

Block training bots if your content is paid or proprietary

If you publish paywalled content, premium research, or commercial courses, you may want to block training crawlers (GPTBot, ClaudeBot, CCBot, Applebot-Extended) while allowing retrieval bots that cite sources (OAI-SearchBot, Perplexity-User, ChatGPT-User, Claude-Web). This lets AI assistants point users to your paywall rather than regurgitate your content for free.

Block everything if you're legally required to

Healthcare providers with HIPAA-sensitive patient-facing pages, financial services with compliance-restricted content, or sites under NDA/private-beta constraints may have contractual or regulatory reasons to block. In those cases, block both crawling (robots.txt) and indexing (noindex headers) at the affected path.

Block specific paths, not whole bots

In most cases you don't need to block a bot entirely — just keep it out of the paths you don't want in training data or AI summaries. Use the same Disallow: directives you'd use for normal SEO (checkout, account pages, search result pages).

Blocking the whole bot

User-agent: GPTBot Disallow: /

Blocking only the paths you need to

User-agent: GPTBot Allow: / Disallow: /premium-research/ Disallow: /paid-courses/

Common mistakes

Blocking Google-Extended but not Googlebot

Many sites try to block AI training while keeping classical search. Good instinct, easy to get wrong. If you block Google-Extended, you're not in Gemini or AI Overviews — which is the fastest-growing search surface Google has. For most businesses this is net negative.

Using robots.txt to hide broken or low-quality pages

robots.txt is not the right tool for “I don't want people to see this.” Use noindex meta tags or authentication. Blocked pages can still appear in search results via link discovery — they just show without snippets, which looks worse than being absent.

Forgetting to reference sitemap.xml

The Sitemap: directive tells crawlers where to find your complete URL inventory. Without it, crawlers rely on link discovery, which can miss deep pages.

Not having a robots.txt at all

No robots.txt means crawlers use their defaults. Most AI crawlers default to “allow everything” — but a few don't, and the absence of a robots.txt is treated as a negative signal by quality-evaluation systems. An explicit file demonstrates that the site is actively maintained.

Case-sensitivity errors

User-agent names are case-insensitive in practice, but paths are case-sensitive. Disallow: /Admin/ won't block /admin/. Match the exact case of your URL structure.

Is your robots.txt actually allowing AI crawlers?

LynxAudit checks your robots.txt, HTTP headers, and AI crawler access status in seconds — and flags every bot that can't reach you.

Run Free Audit

How to test that it's working

After deploying, verify three things:

  1. The file loads. Visit yourdomain.com/robots.txt and confirm you see the expected text. If you get a 404 or the wrong content, the file isn't deployed correctly.
  2. The syntax is valid. Use Google's robots.txt Tester (in Search Console) or a free tool like technicalseo.com's robots.txt validator. Syntax errors can silently disable entire sections.
  3. AI crawlers can actually fetch your pages. Test by User-Agent. From a terminal: curl -A "GPTBot" https://yourdomain.com/sample-page. You should get a 200 response with real HTML, not a 403 or a login wall.

What about llms.txt?

llms.txt is a proposed standard for a different purpose: giving AI systems a concise, curated overview of your site's structure and key resources. It's complementary to robots.txt, not a replacement. If you want to go beyond crawler permissions and actively help AI assistants understand your site, publish both.

A typical llms.txt file at yourdomain.com/llms.txt lists your pillar pages, main product/service descriptions, key documentation, and other high-value content with short summaries, in a format LLMs can parse efficiently. Adoption is early but growing — it's an easy low-effort win if you already have good site structure.

Frequently asked questions

Does robots.txt block AI from knowing about me at all?

No. Even if you block every AI crawler, models already trained on older data still “know” about you to the extent they learned about you before you blocked. And third parties citing you (on Reddit, Wikipedia, directories) still feed signals into AI systems indirectly. robots.txt primarily controls new crawling, not what models already contain.

Will AI crawlers respect my robots.txt if it says disallow?

The major named crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, Applebot-Extended, Meta-ExternalAgent) have publicly committed to compliance. Most independent verifications suggest compliance is broadly real. Small or unnamed bots may ignore robots.txt — but those aren't the ones moving the needle on AI visibility.

How do I monitor which bots are visiting?

Check your access logs. Every request includes a User-Agent header. Grep for GPTBot, ClaudeBot, etc. to see crawl frequency. If a bot you expect to see isn't showing up, check that your robots.txt allows it and that your firewall / CDN isn't rate-limiting or blocking it at the edge (Cloudflare has been known to block AI bots aggressively by default).

Does Cloudflare block AI crawlers?

Cloudflare rolled out a “block AI bots” toggle as a default-on feature for some account types. If you're behind Cloudflare and your AI visibility seems lower than expected, check the Bot Fight Mode and AI Scrapers settings in your dashboard. You may be blocking at the edge even though your robots.txt allows.

Should I allow CCBot?

For most businesses, yes. Common Crawl feeds Hugging Face's training datasets, the RefinedWeb corpus used by Falcon, and countless open-source models. Blocking CCBot removes you from a huge share of the open-source AI ecosystem. The only strong argument for blocking is if your content is copyrighted and you want to prevent training use broadly.

Can I request re-crawl after updating robots.txt?

Google and Bing offer re-crawl requests via Search Console and Webmaster Tools. For OpenAI, Anthropic, and Perplexity, there's currently no submission tool — their crawlers rediscover changes on their own schedule (usually within days to weeks).

Bottom line

robots.txt is the first thing AI crawlers fetch when they visit your site. Get this wrong and nothing else you do for AI visibility matters, because the bots never reach your content. Get it right — a clean, inclusive file that allows every major AI crawler and references your sitemap — and you've cleared the foundational technical bar. The rest of the GEO stack (schema, E-E-A-T, entity signals, content patterns) builds on top.

See how AI talks about your business

Run a free AI Visibility Audit. We check 100+ questions AI gets about your industry — and tell you if you are in the answers.

Run Free Audit
    robots.txt for AI Crawlers: The Complete Guide (+ Ready-to-Paste File) | LynxAudit Blog