What is a robots.txt file?

A robots.txt file is a text file placed in your website's root directory that tells search engine crawlers which pages or sections to crawl or not crawl. It follows the Robots Exclusion Protocol and helps manage crawler access to your site.

Where should I place my robots.txt file?

The robots.txt file must be placed in your website's root directory and be accessible at yourdomain.com/robots.txt. Search engines look for it at this exact location. The file name must be lowercase.

What is the difference between Disallow and Allow?

Disallow tells crawlers not to access specified paths, while Allow permits access to specific paths within a disallowed directory. Allow is useful for making exceptions. For example, you can disallow /admin/ but allow /admin/public/.

Does robots.txt hide pages from search engines?

No, robots.txt only suggests what crawlers should not access - it doesn't hide content. Pages blocked by robots.txt can still appear in search results if linked from other sites. To truly hide pages, use the noindex meta tag or password protection.

Free Robots.txt Generator - Create SEO-Optimized Robots Files Online

What is Robots.txt?

The robots.txt file is a simple text file placed in your website's root directory that communicates with web crawlers and bots. Following the Robots Exclusion Protocol (REP), first introduced in 1994, it tells search engines which pages they can and cannot access. While it's a powerful tool for managing crawler behavior, it's important to understand both its capabilities and limitations.

When a search engine bot visits your site, the first thing it does is look for yourdomain.com/robots.txt. Based on the instructions it finds, the bot decides which pages to crawl or skip. This happens before any actual crawling takes place, making robots.txt the gatekeeper of your website's crawlability.

Control Access

Specify which bots can crawl which parts of your site

Save Crawl Budget

Direct crawlers to important pages, skip low-value content

Hide Sections

Keep admin areas, staging, and private sections from crawlers

Link Sitemaps

Point crawlers to your XML sitemap for better indexing

Robots.txt Syntax and Directives

The robots.txt file uses a simple syntax with specific directives. Understanding each directive is essential for proper configuration:

`User-agent`

Specifies which crawler the following rules apply to.

# Apply to all crawlers
User-agent: *

# Apply only to Googlebot
User-agent: Googlebot

# Apply only to Bing
User-agent: Bingbot

Common user agents: Googlebot, Bingbot, Slurp (Yahoo), DuckDuckBot, Baiduspider, Yandex, facebot, Twitterbot

`Disallow`

Tells crawlers NOT to access specific paths.

# Block a specific page
Disallow: /private-page.html

# Block an entire directory
Disallow: /admin/

# Block all pages with query string
Disallow: /*?

# Block everything (entire site)
Disallow: /

Note: An empty Disallow: means nothing is blocked.

`Allow`

Permits access to specific paths, overriding Disallow rules. Useful for exceptions.

# Block /private/ but allow one page
User-agent: *
Disallow: /private/
Allow: /private/public-page.html

# Block all PDFs except one
Disallow: /*.pdf$
Allow: /docs/whitepaper.pdf

Note: Allow is supported by Google and Bing but not all crawlers.

`Sitemap`

Points crawlers to your XML sitemap(s) for better discovery of pages.

# Single sitemap
Sitemap: https://example.com/sitemap.xml

# Multiple sitemaps
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-products.xml

Best practice: Always use absolute URLs with https://

Additional Directives

Directive	Purpose	Example
`Crawl-delay`	Sets seconds between requests (Bing, Yandex; ignored by Google)	`Crawl-delay: 10`
`*` (wildcard)	Matches any sequence of characters in paths	`Disallow: /category/*/page`
`$` (end match)	Pattern must match the end of URL	`Disallow: /*.pdf$`
`#` (comment)	Adds notes or explanations (ignored by bots)	`# This blocks admin section`

Robots.txt Examples by Platform

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /trackback/
Disallow: /feed/
Disallow: /?s=
Disallow: /*?*

Sitemap: https://example.com/sitemap_index.xml

E-commerce

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /my-account
Disallow: /wishlist
Disallow: /orders/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Sitemap: https://example.com/sitemap.xml

What Robots.txt Can and Cannot Do

What Robots.txt CAN Do

Block well-behaved crawlers from accessing specific paths
Reduce server load by limiting crawler activity
Optimize crawl budget by directing bots to important pages
Prevent indexing of duplicate content (search pages, filters)
Hide development/staging sections from search engines
Point crawlers to your sitemap for better discovery
Apply different rules to different bots

What Robots.txt CANNOT Do

Guarantee pages won't be indexed – other sites may link to them
Hide content from users – robots.txt is publicly accessible
Stop malicious bots – bad actors ignore robots.txt
Protect sensitive data – use authentication instead
Remove already-indexed pages – use noindex meta tag
Control search result appearance – use meta tags
Block all web scrapers – only well-behaved bots respect it

Important Security Warning

Never use robots.txt as a security measure! The file is publicly accessible at yoursite.com/robots.txt. Blocking a path actually reveals that something exists there. For private content, use proper authentication or keep it off the server entirely.

Robots.txt vs. Other Methods

Method	What It Does	Best For
Robots.txt	Blocks crawling (bot never visits page)	Saving crawl budget, blocking entire sections
Meta Robots Tag	Blocks indexing (page crawled but not indexed)	Removing specific pages from search results
X-Robots-Tag Header	Same as meta robots but via HTTP header	PDFs, images, and non-HTML files
Canonical Tag	Indicates preferred URL version	Duplicate content, URL parameters

Common Mistakes and Best Practices

Common Mistakes

Blocking CSS/JS: Google needs these to render pages. Don't block theme files.
Blocking entire site: Disallow: / blocks everything!
Using for security: Anyone can read your robots.txt
Forgetting trailing slashes: /admin vs /admin/ behave differently
Wrong file location: Must be in root directory only
Expecting de-indexing: Blocked pages can still appear if linked

Best Practices

Test before deploying: Use Google Search Console's tester
Keep it simple: Complex rules are hard to maintain
Include sitemap: Always add your sitemap URL
Use comments: Document why each rule exists
Monitor errors: Check Search Console regularly
Review regularly: Update as your site changes

AI Crawlers and Robots.txt

With AI systems like ChatGPT and Claude, many website owners want to control whether AI companies can use their content for training:

Company	User Agent	Purpose
OpenAI	`GPTBot`	Crawls for ChatGPT training data
OpenAI	`ChatGPT-User`	Real-time browsing for ChatGPT users
Google	`Google-Extended`	Crawls for Gemini training
Anthropic	`anthropic-ai`	Crawls for Claude training data
Common Crawl	`CCBot`	Open dataset used by many AI companies

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

Testing Your Robots.txt

Google Search Console

Test if URLs are blocked and validate changes before deploying.

Open Tool

Bing Webmaster Tools

Validate your file and see what Bingbot can access.

Open Tool

Manual Check

Visit yourdomain.com/robots.txt to see what crawlers see.

Free Robots.txt Generator

Configuration

Quick Syntax Guide

Generated Robots.txt

Validation

How to Use