Create Robots.txt File - Search Engine Crawler Configuration
Build a robots.txt file to control how search engines crawl and index your website. Block specific pages, set crawl delays, and add sitemap references.
Global Settings (All Bots)
Add Custom Rule
What is a robots.txt File?
The robots.txt file is a plain text file placed in your website's root directory that tells search engine crawlers (like Googlebot) which pages or sections of your site they should or shouldn't crawl. It follows the Robots Exclusion Protocol, a standard that's been used since 1994 to manage crawler access to websites.
While robots.txt is essentially a request rather than an enforcement mechanism (malicious bots can ignore it), all major search engines respect these directives. It's an essential tool for managing crawl budget, preventing indexing of duplicate content, and protecting sensitive administrative areas.
Understanding robots.txt Directives
- User-agent: Specifies which crawler the following rules apply to. Use * for all crawlers.
- Disallow: Prevents the specified crawler from accessing the given path.
- Allow: Explicitly permits access to a path within a disallowed directory.
- Crawl-delay: Sets the number of seconds between crawler requests (not supported by Google).
- Sitemap: Points crawlers to your XML sitemap for comprehensive indexing.
Common Pages to Block
- /admin/: Administrative panels and dashboards.
- /wp-admin/: WordPress admin area.
- /api/: API endpoints not meant for search indexing.
- /search/: Search results pages (duplicate content).
- /cart/, /checkout/: E-commerce transactional pages.
- /private/: User-specific or private content areas.
- /*.pdf: PDF files (if you don't want them indexed).
Search Engine Bot User Agents
- Googlebot: Google's main web crawler.
- Googlebot-Image: Google's image search crawler.
- Bingbot: Microsoft Bing's crawler.
- DuckDuckBot: DuckDuckGo's crawler.
- Yandex: Russian search engine crawler.
- Baiduspider: Chinese search engine Baidu's crawler.
- AhrefsBot/SemrushBot: SEO tool crawlers (often blocked).
robots.txt vs. Noindex Meta Tags
These two tools serve different purposes. The robots.txt file prevents crawling, while the noindex meta tag prevents indexing. If you block a page in robots.txt, Google can't see the noindex tag, so the page might still appear in search results (with a limited snippet). For pages you want completely excluded from search, use noindex and don't block them in robots.txt.
Best Practices for robots.txt
- Always test your robots.txt using Google Search Console's testing tool.
- Place the file in your root directory: example.com/robots.txt.
- Include your sitemap URL for faster indexing.
- Don't block CSS or JavaScript files needed for rendering.
- Be careful with wildcards (*) - they can block more than intended.
- Remember: robots.txt doesn't hide content from users, only crawlers.
Example robots.txt Configurations
For a typical WordPress site, you might block /wp-admin/ and /wp-includes/ while allowing /wp-admin/admin-ajax.php (needed for AJAX functionality). For e-commerce sites, blocking /cart/, /checkout/, and /account/ prevents unnecessary crawl budget waste on pages that don't need indexing.