Table of Contents

1. What Is a Robots.txt File?

A robots.txt file is a plain text file placed at the root directory of a website that gives instructions to search engine crawlers — also called bots or spiders — about which pages or sections of a site they are allowed or not allowed to visit.

The file follows the Robots Exclusion Standard (RES), a protocol dating to 1994 that is now honoured by virtually every major search engine including Google, Bing, and Yahoo.

A basic robots.txt file looks like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://www.example.com/sitemap.xml

This example tells all search engine crawlers:

Do not crawl anything under /admin/
Do not crawl anything under /private/
Everything else on the site is fair game
The sitemap is located at the specified URL

The file must always be named exactly robots.txt (lowercase) and must live at the very root of your website — for example, https://www.example.com/robots.txt.

2. Why Robots.txt Matters for SEO

Understanding the purpose of robots.txt helps you use it correctly — and avoid the serious mistakes that can accidentally tank your search visibility.

What Robots.txt Is For

Managing crawl budget: Search engines have a finite amount of time and resources they allocate to crawling your site (your “crawl budget”). By blocking irrelevant pages — thin content, pagination, search result pages, admin interfaces — you concentrate Googlebot’s attention on pages that actually matter for rankings.
Protecting private areas: Preventing crawlers from accessing login pages, internal dashboards, staging environments, or duplicate content sections.
Controlling which media files appear in search: Blocking images, PDFs, or video files that you do not want appearing in image or file search results.
Managing faceted navigation: E-commerce and directory sites with thousands of URL parameter combinations (filter pages, sorting variations) use robots.txt to prevent duplicate content from consuming crawl budget.

What Robots.txt Is NOT For

This is where many website owners go wrong. Robots.txt is frequently misused for purposes it was never designed to serve:

Robots.txt does NOT guarantee a page won’t be indexed. Google can still index a blocked URL if other websites link to it — it simply cannot read the page’s content. A disallowed page can still appear in Google’s index as a URL without a description.
Robots.txt is NOT a privacy or security tool. The file is publicly accessible to anyone — including bad actors. Never list sensitive directory paths in robots.txt expecting them to stay private.
Robots.txt does NOT remove pages from Google’s index. If you need a page removed from search results, use the noindex meta tag or the URL Removal Tool in Google Search Console.

3. How Google Discovers and Uses Robots.txt

Understanding how Google actually processes robots.txt prevents confusion about what happens after you create or update your file.

Automatic Discovery

Google does not require you to tell it where your robots.txt file is. Googlebot automatically checks for a robots.txt file at the root of every domain it crawls. When visiting https://example.com/any-page, it first fetches https://example.com/robots.txt to check the rules before crawling any further.

Caching Cycle

During the automatic crawling process, Google’s crawlers notice changes you made to your robots.txt file and update the cached version every 24 hours. This means changes you make today may not be reflected in Google’s behaviour for up to one full day — unless you request a faster recrawl through Search Console (covered in Step 6).

What Happens When There Is No Robots.txt

If Google cannot find a robots.txt file at your domain root — receiving a 404 Not Found response — it interprets this as permission to crawl the entire site without restrictions. This is perfectly acceptable behaviour; you do not need a robots.txt file if you have no pages to block.

If your server returns a 5xx server error when Google tries to fetch robots.txt, Google will temporarily treat this as a “temporarily blocked” signal and may pause crawling of the site until the file becomes accessible again.

4. Important: Do You Actually Need to “Submit” Robots.txt?

This is the question most guides fail to answer clearly upfront: you do not submit robots.txt to Google the same way you submit a sitemap.

Here is the key distinction:

	Sitemap	Robots.txt
Requires manual submission to Google	✅ Yes	❌ No
Google finds it automatically	⚠️ Yes, but submission speeds up discovery	✅ Yes, always
Submission method	Search Console → Sitemaps	Not applicable
Cache refresh available	❌ Not directly	✅ Via robots.txt report in Search Console

Once you uploaded and tested your robots.txt file, Google’s crawlers will automatically find and start using your robots.txt file. You don’t have to do anything.

What you can do in Google Search Console is:

Monitor whether Google has successfully found and parsed your robots.txt file
View errors or warnings in the file as Google sees it
Request a faster recrawl when you’ve made important changes and don’t want to wait up to 24 hours for Google’s cache to update automatically

The steps below cover the full process — creating the file correctly, uploading it, verifying it in Search Console, and using the robots.txt report to monitor and refresh it.

5. Step 1 — Create Your Robots.txt File

Using a Text Editor (Recommended for Manual Sites)

The most reliable way to create a robots.txt file is with a plain text editor:

Windows: Notepad (not Word or WordPad — these add hidden formatting characters)
macOS: TextEdit (make sure you switch to plain text mode first: Format → Make Plain Text)
Linux/Server: nano, vim, or any terminal text editor

Step-by-step:

Open your text editor
Type your robots.txt rules (see Step 2 for syntax)
Go to File → Save As
Name the file exactly: robots.txt (lowercase, no other extension)
Set encoding to UTF-8 if prompted (critical — do not use ANSI or UTF-16)
Save the file

Using an Online Robots.txt Generator

If you prefer a guided approach, several free tools generate a robots.txt file based on your inputs:

Google’s own guidance at developers.google.com
Yoast SEO Robots.txt Generator (for WordPress users)
SEOptimer Robots.txt Generator
Free Robots.txt Generator (robotstxt.net)

These tools present a form where you select which bots to allow or block and which directories to restrict, then output the finished file for you to download.

Rules for Naming and Location

Your robots.txt file must follow these rules exactly, or Google will not recognise it:

The file must be named exactly robots.txt — no capital letters, no .html extension, no variations
It must be saved as a plain text file with UTF-8 encoding
It must be placed at the root directory of your site — meaning https://www.yourdomain.com/robots.txt must return the file
It cannot be in a subdirectory like https://www.yourdomain.com/files/robots.txt
Each subdomain requires its own robots.txt file — a file at example.com does not apply to shop.example.com

6. Step 2 — Write the Correct Robots.txt Rules

Robots.txt syntax is simple but precise. Mistakes in formatting can have serious consequences for crawling and indexing. Here is a complete breakdown of valid rules and how to use them.

Core Directives

User-agent: — Specifies which crawler the rules apply to.

User-agent: *          # Applies to all crawlers
User-agent: Googlebot  # Applies only to Google's main crawler
User-agent: Bingbot    # Applies only to Bing's crawler

Disallow: — Blocks a crawler from accessing a specific path.

Disallow: /admin/        # Block the entire /admin/ directory
Disallow: /private.html  # Block a specific page
Disallow: /              # Block the entire site (use with extreme caution!)
Disallow:                # Allow everything (empty value = allow all)

Allow: — Explicitly permits access to a path, used to override a broader Disallow rule.

Allow: /admin/public/    # Allow this subdirectory even if /admin/ is disallowed
Allow: /                 # Allow all (this is the default — usually not needed)

Sitemap: — Tells crawlers where to find your XML sitemap.

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-news.xml

Complete Robots.txt Examples

Example 1: Allow all crawlers to access the entire site (most common for new sites)

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Example 2: Block admin, staging, and search result pages

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/
Disallow: /search?
Disallow: /?s=
Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.example.com/sitemap.xml

Example 3: Block all crawlers from the entire site (useful for staging/development)

User-agent: *
Disallow: /

Example 4: Block all crawlers except Googlebot

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Example 5: Typical e-commerce site blocking thin/duplicate pages

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Syntax Rules to Always Follow

Each directive must be on its own line — no combining multiple rules on one line
Rules are case-sensitive — Disallow: /Admin/ is different from Disallow: /admin/
Use # to add comments — everything after # on a line is ignored by crawlers
Leave a blank line between rule groups for different user agents
The * wildcard matches any sequence of characters in a path
A $ at the end of a path matches only the exact URL (e.g., Disallow: /page.html$)
Crawlers match only the first applicable rule group for their user agent — put more specific rules before general ones

7. Step 3 — Upload Robots.txt to Your Website’s Root Directory

Once your robots.txt file is created and written correctly, it needs to be uploaded to the root directory of your website. The exact method depends on your hosting environment.

Via FTP or SFTP (Traditional Hosting)

Open your FTP client (FileZilla, Cyberduck, or WinSCP are popular free options)
Connect to your server using your hosting credentials
Navigate to the root directory of your website (typically public_html, www, or htdocs)
Upload the robots.txt file to this root directory
If a robots.txt file already exists, confirm the overwrite

Verify the path: After uploading, the file must be accessible at https://www.yourdomain.com/robots.txt — not in any subfolder.

Via cPanel File Manager (Shared Hosting)

Log in to your hosting account’s cPanel
Click File Manager
Navigate to public_html (your site’s root)
Click Upload and select your robots.txt file
If one already exists, overwrite it

Via SSH / Command Line

If you have SSH access to your server:

# Using curl to download an existing robots.txt (if updating)
curl https://example.com/robots.txt -o robots.txt

# Edit the file locally, then upload using SCP
scp robots.txt user@yourserver.com:/var/www/html/robots.txt

Via CMS/Platform Dashboard (See Section 13–14)

WordPress, Wix, Shopify, and other hosted platforms have their own methods for managing robots.txt — covered in detail in Sections 13 and 14.

8. Step 4 — Verify Robots.txt Is Publicly Accessible

Before checking Search Console, confirm that your robots.txt file is publicly accessible and returning the correct content.

Method 1: Browser Check (Simplest)

Open a private / incognito browser window (to avoid cached content)
Type your domain followed by /robots.txt in the address bar:
```
https://www.yourdomain.com/robots.txt
```
You should see the plain text content of your robots.txt file
If you see a 404 error, the file is not in the correct location or is not named correctly
If you see a blank page or HTML, the file may have been saved in the wrong format

Method 2: Using curl

From a terminal or command prompt:

curl -I https://www.yourdomain.com/robots.txt

Look for HTTP/2 200 or HTTP/1.1 200 OK in the response. A 404 means the file is missing; a 500 means a server error.

9. Step 5 — Check Robots.txt in Google Search Console

Google Search Console includes a dedicated robots.txt report that shows you how Google sees your robots.txt file — whether it has been successfully fetched, any errors or warnings present, and the full history of crawl requests.

How to Access the Robots.txt Report

Go to search.google.com/search-console and log in
Select your website property from the dropdown
In the left sidebar, click Settings (gear icon at the bottom)
Scroll down to find the robots.txt section
Click to expand the robots.txt report

Note: The robots.txt report is available only for properties at the domain level. That means either a Domain property (such as example.com or m.example.com), or a URL-prefix property without a path, such as https://example.com/, but not https://example.com/path/.

If you have a URL-prefix property with a path (for example, https://example.com/blog/), you will not see the robots.txt report. In this case, set up a Domain property or root URL-prefix property to access it.

What the Robots.txt Report Shows

The robots.txt report shows which robots.txt files Google found for the top 20 hosts on your site, the last time they were crawled, and any warnings or errors encountered.

For each robots.txt file, the report displays:

File path: The full URL where Google checked for the robots.txt file.

Fetch status: The result of Google’s most recent attempt to retrieve your robots.txt. Possible values include:

✅ Fetched: Google successfully retrieved and parsed your robots.txt file. No critical issues.
⚠️ Fetched with warnings: Google found the file but encountered non-critical issues — for example, unrecognised directives or syntax that may not behave as intended.
❌ Not Fetched — Not found (404): Google could not find a robots.txt file at this URL. If you have not intentionally removed the file, check that it is correctly uploaded to your root directory.
❌ Not Fetched — Other reason: Another error occurred during the fetch — typically a server error (5xx), DNS failure, or connection timeout.

Last crawl date: When Google last fetched your robots.txt file.

Versions history: To see fetch requests for a given robots.txt file in the last 30 days, click the file in the files list in the report, then click Versions. To see the file contents at that version, click the version. A request is included in the history only if the retrieved file or fetch result is different from the previous file fetch request.

10. Step 6 — Request a Recrawl (When You’ve Updated the File)

When you make changes to your robots.txt file — especially urgent changes like unblocking important pages that were accidentally blocked — you can ask Google to re-fetch the file faster than its normal 24-hour update cycle.

When to Request a Recrawl

You can request a recrawl of a robots.txt file when you fix an error or make a critical change. You generally don’t need to request a recrawl of a robots.txt file, because Google recrawls your robots.txt files often. However, you might want to request a recrawl of your robots.txt in the following circumstances: You changed your robots.txt rules to unblock some important URLs and want to let Google know quickly (note that this doesn’t guarantee an immediate recrawl of unblocked URLs).

Other good reasons to request a recrawl:

You discovered your entire site was accidentally blocked and you’ve just fixed it
You added a new subdomain with its own robots.txt and want Google to acknowledge it quickly
You’ve made critical blocking changes before a product launch or campaign

How to Request a Recrawl — Step by Step

Log in to Google Search Console at search.google.com/search-console
Select your website property
Click Settings in the left navigation (gear icon)
Scroll to the robots.txt section
Find the robots.txt file you want to refresh in the file list
Click the three-dot menu (⋮) icon next to the file
Click “Request a recrawl”
Confirm the request

Google will then prioritise re-fetching that robots.txt file sooner than the standard automated cycle. You will see the updated “last crawled” timestamp in the report once the recrawl is complete.

Important: Requesting a recrawl updates Google’s cached copy of your robots.txt rules, but it does not immediately trigger a recrawl of all the URLs affected by those rules. Googlebot will apply the new rules the next time it crawls each individual page.

11. Understanding the Robots.txt Report in Search Console

The robots.txt report replaced the old robots.txt Tester tool that Google retired in late 2023. With this new robots.txt report, Google has decided to sunset the robots.txt tester. Understanding what the new report shows helps you interpret the data correctly.

Report Sections Explained

Files list: Shows all robots.txt files Google has found across the top 20 hosts in your property. For most single-domain sites, this is just one entry — https://www.yourdomain.com/robots.txt.

For sites with multiple subdomains (such as www.example.com, blog.example.com, shop.example.com), each subdomain’s robots.txt file appears as a separate entry. Each must be configured independently.

Fetch status details:

Status	What It Means	What to Do
Fetched	File found and parsed successfully	No action needed
Fetched with warnings	File found but has syntax issues	Review warnings and fix the file
Not Fetched (404)	File not found at this URL	Upload the file to the correct root directory
Not Fetched (other)	Server error, DNS issue, or connection failure	Check your server health and availability

Version history: A chronological log of every time Google fetched your robots.txt file and found it different from the previous version. This helps you confirm that Google has picked up your recent changes. If you updated your file but do not see a new version entry, use the Request a Recrawl option.

Robots.txt Information in the Page Indexing Report

In addition to the dedicated robots.txt report, Search Console surfaces robots.txt information in the Page Indexing report (under Indexing → Pages). Pages that are blocked by robots.txt will appear in the “Why pages aren’t indexed” section under the label “Blocked by robots.txt.” This gives you a URL-level view of which specific pages are affected.

12. How to Update an Existing Robots.txt File

If you already have a robots.txt file and need to modify it, follow this process:

Step 1: Download Your Current Robots.txt

You can retrieve your current robots.txt in several ways:

Visit https://yourdomain.com/robots.txt in a browser, select all, and copy the content
Use curl: curl https://yourdomain.com/robots.txt -o robots.txt
Use the robots.txt report in Search Console to copy the content of your robots.txt file, which you can then paste into a file on your computer.

Step 2: Edit the File

Open the downloaded file in a plain text editor. Make your changes, being careful with:

Correct spacing (no trailing spaces after directives)
Correct case (paths are case-sensitive)
UTF-8 encoding when saving

Step 3: Re-upload

Upload the edited file back to the root directory of your site, overwriting the existing file.

Step 4: Verify and Request Recrawl

Open a private browser window and check yourdomain.com/robots.txt to confirm the updated content
Go to Search Console → Settings → robots.txt → Request a recrawl to fast-track Google’s cache update

13. How to Add Robots.txt on WordPress

WordPress handles robots.txt in two ways: via a plugin (recommended) or by uploading a physical file.

Method 1: Using Yoast SEO (Most Popular)

Yoast SEO, the most widely used WordPress SEO plugin, provides a built-in robots.txt editor:

In your WordPress dashboard, go to SEO → Tools
Click File editor
If a robots.txt file already exists on the server, Yoast will display it here for editing
Make your changes in the text area
Click Save changes to robots.txt

Yoast saves the file directly to your site’s root — no FTP required.

Method 2: Using Rank Math SEO

Rank Math, another popular WordPress SEO plugin, also includes a robots.txt editor:

Go to Rank Math → General Settings
Click the Edit robots.txt button
Edit the rules in the text area
Click Save Changes

Method 3: Uploading a Physical File via FTP

Create your robots.txt file in a text editor
Connect to your server via FTP (using FileZilla or similar)
Navigate to your WordPress root directory (where wp-config.php lives)
Upload the robots.txt file

Note: If you upload a physical robots.txt file, it takes precedence over the virtual robots.txt file that WordPress generates by default. The physical file will also override Yoast’s editor.

WordPress Default Robots.txt

If no physical robots.txt file exists, WordPress serves a virtual default:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

This is the minimum recommended configuration — it blocks the admin panel while keeping the AJAX handler accessible (required for some front-end WordPress features).

14. How to Add Robots.txt on Wix, Shopify & Other Hosted Platforms

Fully hosted website builders manage the server environment for you, which means you typically cannot upload a physical file to the root directory. Instead, each platform provides its own mechanism.

Wix

Wix does not allow users to edit the robots.txt file directly on standard plans. The platform automatically generates a robots.txt file for your site. To influence crawling behaviour, use Wix’s SEO settings:

Go to your Wix dashboard → SEO & Marketing → SEO Tools
Use the Advanced SEO settings to hide specific pages from search engines
For full robots.txt control, Wix Business Elite plan users can contact Wix support to request customisation

Shopify

Shopify allows robots.txt customisation on its Online Store 2.0 themes:

In your Shopify admin, go to Online Store → Themes
Click Actions → Edit code on your active theme
Find robots.txt.liquid in the Templates section (if it doesn’t exist, create it)
Edit the Liquid template to add or override robots.txt rules
Click Save

For simpler changes, Shopify has a default robots.txt that blocks internal pages, checkout, cart, and account pages automatically.

Squarespace

Squarespace does not provide direct robots.txt editing. The platform manages a default robots.txt configuration that allows all crawlers access to public pages. To hide a page from search:

Go to Pages → Page settings for the specific page
Toggle Hide this page from search results under SEO settings

Blogger (Google)

Blogger provides a custom robots.txt editor:

Go to your Blogger dashboard → Settings
Scroll to Crawlers and indexing
Enable Custom robots.txt
Enter your robots.txt content
Click Save changes

15. Common Robots.txt Mistakes and How to Fix Them

Mistake 1: Blocking the Entire Site

# WRONG — This blocks everything, including all your important pages
User-agent: *
Disallow: /

This is one of the most catastrophic robots.txt errors — it prevents Google from crawling your entire website. It commonly happens when:

A developer sets it up on a staging environment and forgets to remove it before going live
A WordPress plugin or theme update overwrites the robots.txt with a default “block all” configuration

Fix: Change the Disallow to a specific path or remove it entirely. If you want Google to crawl everything, use:

User-agent: *
Allow: /

Or simply delete the file entirely — Google will crawl all pages if no robots.txt exists.

Mistake 2: Blocking CSS and JavaScript Files

Blocking stylesheets and scripts prevents Google from rendering your pages correctly, which can negatively impact how Google understands your site’s content and user experience.

# WRONG
User-agent: *
Disallow: /wp-content/
Disallow: /assets/

Fix: Only block specific resource directories if truly necessary. Do not block the directories that contain your site’s CSS, JavaScript, or fonts.

Mistake 3: Using Robots.txt Instead of Noindex

# THIS DOES NOT RELIABLY REMOVE PAGES FROM GOOGLE'S INDEX
User-agent: *
Disallow: /thank-you/
Disallow: /confirmation/

Blocking a page in robots.txt does not remove it from Google’s index if the page has already been indexed or if it receives external links. Pages disallowed in robots.txt can still be indexed if linked externally.

Fix: Use the noindex meta tag inside the <head> section of the page you want excluded from search results:

<meta name="robots" content="noindex, follow">

Then allow crawlers to access the page in robots.txt (or leave it unblocked) so they can read the noindex instruction.

Mistake 4: Wrong File Name or Location

Common mistakes:

Naming the file Robots.txt, robot.txt, or robots.txt.txt
Placing it in a subdirectory like /files/robots.txt instead of the root

Fix: The file must be exactly robots.txt (all lowercase) and located at the root of your domain.

Mistake 5: Syntax Errors

# WRONG — Missing colon
User-agent *
Disallow /admin/

# WRONG — Multiple directives on one line
User-agent: * Disallow: /admin/

# CORRECT
User-agent: *
Disallow: /admin/

Fix: Each directive must be on its own line. Always include the colon after the directive name. Use one of the third-party testing tools in Section 20 to catch syntax errors before uploading.

Mistake 6: Forgetting the Trailing Slash on Directories

# This blocks only the exact URL /admin, not /admin/dashboard or /admin/settings
Disallow: /admin

# This correctly blocks /admin and everything under it
Disallow: /admin/

Fix: Always add a trailing slash when blocking a directory.

16. How to Fix “Blocked by Robots.txt” Errors in Google Search Console

If Google Search Console shows pages with a “Blocked by robots.txt” status under Indexing → Pages, it means Google has found pages it wants to crawl but is being prevented by your robots.txt rules.

How to Investigate

Go to Google Search Console → Indexing → Pages
Look for pages listed under “Why pages aren’t indexed”
Click “Blocked by robots.txt” to see the specific URLs affected

Step-by-Step Fix

Step 1: Identify which rule is blocking the page.

Visit https://yourdomain.com/robots.txt and look for Disallow rules that match the blocked URL pattern.

Step 2: Determine whether the block is intentional.

If it IS intentional (e.g., you do not want admin pages or private content crawled), no action is needed — the report is informational, not necessarily an error.
If it is NOT intentional, proceed to fix the robots.txt file.

Step 3: Edit your robots.txt to remove or modify the blocking rule.

For example, if /products/sale/ is accidentally blocked:

# Remove or modify this line:
Disallow: /products/

# Replace with specific blocks only:
Disallow: /products/internal/
Disallow: /products/drafts/

Step 4: Upload the corrected robots.txt file to your root directory.

Step 5: Request a recrawl in Search Console (Settings → robots.txt → Request a recrawl).

Step 6: Use the URL Inspection Tool in Search Console to request indexing for the specific pages that were unblocked.

17. How to Fix “Indexed, Though Blocked by Robots.txt” Errors

This status is the inverse problem: Google has indexed a page even though your robots.txt says not to crawl it.

When you block a page with robots.txt, you tell search engines not to crawl it. But if it still shows up in search results, that’s the issue of “Indexed, though blocked by robots.txt.” Google can index the content even if it can’t crawl it, meaning it might still appear in search results based on other factors like backlinks or the page’s importance.

Why This Happens

External websites link to the blocked page — Google knows the URL exists, even if it cannot read the content
The page was already indexed before the robots.txt block was added
Conflicting signals between robots.txt and meta tags

The Correct Fix

If you do NOT want the page in Google’s index:

Remove the Disallow rule from robots.txt (allow Googlebot to access the page)

Add a noindex meta tag to the page’s HTML:

<meta name="robots" content="noindex, follow">

Request a recrawl of the page via the URL Inspection Tool
Google will crawl the page, read the noindex tag, and remove it from the index

If you DO want the page in the index but don’t want it crawled (unusual edge case):

This is not a recommended configuration. If you want a page in Google’s index, allow crawling and use noindex selectively for pages you want excluded.

Key principle: If you don’t want the page indexed, consider adding a noindex tag instead of using the disallow directive in robots.txt. You still need to remove the disallow directive from robots.txt. If you keep both, the “Indexed, though blocked by robots.txt” error report in Google Search Console will continue to grow, and you will never solve the issue.

18. Robots.txt vs. Noindex: Which Should You Use?

One of the most commonly confused distinctions in technical SEO is when to use robots.txt versus the noindex meta tag.

	Robots.txt Disallow	Noindex Meta Tag
Prevents crawling	✅ Yes	❌ No (page is still crawled)
Prevents indexing	❌ Not reliably	✅ Yes
Blocks page from appearing in search results	⚠️ Not guaranteed	✅ Yes, when crawled
Google can still index the URL	✅ Yes (if linked externally)	❌ No (once crawled and processed)
Good for hiding page content from crawlers	✅ Yes	❌ No
Good for managing crawl budget	✅ Yes	❌ No
Good for removing pages from search results	❌ No	✅ Yes
Applies to all bots vs. only crawlers that respect it	Crawlers only	Crawlers that read meta tags

When to Use Robots.txt Disallow

Pages you never want crawled at all: admin interfaces, staging areas, internal search result pages, faceted navigation URLs
Thin or duplicate content that exists primarily for technical reasons and consumes crawl budget
Large directories of files (like /cdn-uploads/raw/) that have no indexing value

When to Use Noindex

Thank you pages, confirmation pages, or other pages you don’t want appearing in search results but that need to be technically accessible
Duplicate pages (pagination beyond page 1, printer-friendly versions)
Pages with valuable content for users that should not appear in Google’s index

The Rule to Remember

Use robots.txt to control crawling. Use noindex to control indexing. When in doubt, allow crawling and use noindex — it gives Google clearer instructions and avoids the “Indexed, though blocked by robots.txt” problem.

19. Robots.txt Best Practices for SEO

1. Always Include Your Sitemap URL

Sitemap: https://www.example.com/sitemap.xml

Adding your sitemap to robots.txt helps Google discover it even if you have not submitted it via Search Console. Many crawlers read the sitemap location from robots.txt automatically.

2. Test Before Publishing

Before uploading any robots.txt change to your live site, test it with a third-party tool (see Section 20) to verify the rules behave as intended. A single typo in a robots.txt file can accidentally block your entire site from Google.

3. Block Low-Value Pages That Consume Crawl Budget

For large sites (tens of thousands of pages), conserving crawl budget is important. Common candidates for blocking:

Disallow: /search/           # Internal search result pages
Disallow: /tag/              # WordPress tag archives (if not valuable)
Disallow: /*?sort=           # Faceted navigation sort parameters
Disallow: /*?filter=         # Filter URL parameters
Disallow: /print/            # Printer-friendly page versions
Disallow: /feed/             # RSS feed directories

4. Do Not Block Pages You Want Indexed

This sounds obvious but is one of the most common mistakes. Always cross-reference your robots.txt Disallow rules against your target pages to ensure important content is not accidentally blocked.

5. Review Robots.txt After Major Site Changes

Large structural changes — URL restructuring, new subdirectories, migration to a new platform — often require updating robots.txt. Add a robots.txt review to your migration and launch checklists.

6. Keep It Simple

A complex robots.txt with hundreds of rules is difficult to maintain and can produce unexpected interactions between rules. Keep the file as simple as your site architecture requires.

7. Separate Staging and Production Environments

Your staging/development environment should have:

User-agent: *
Disallow: /

Your production environment should never have this rule. Use environment variables or deployment pipeline checks to ensure the wrong robots.txt does not end up on your live site.

20. Tools to Test Your Robots.txt File

Since Google retired the built-in robots.txt Tester from Search Console in 2023, testing now requires either the Search Console’s URL Inspection Tool or one of the following third-party alternatives:

Google Search Console — URL Inspection Tool

The URL Inspection Tool can simulate how Google sees a specific URL, including whether it is blocked by robots.txt:

In Search Console, click the search bar at the top and enter the URL you want to inspect
Click “Test live URL” to check the current state
If the page is blocked by robots.txt, the tool will indicate this under “URL is not on Google” → “Blocked by robots.txt”

Google’s Open Source Robots.txt Library

For developers, Google maintains an open-source implementation of its robots.txt parser on GitHub:

https://github.com/google/robotstxt

This is the same library used in Google Search. Developers can use it to test robots.txt rules locally before deploying.

Third-Party Testing Tools

Tool	URL	Features
Merkle Robots.txt Tester	technicalseo.com/tools/robots-txt-tester	Free; test any URL against custom rules
Ryte Robots.txt Checker	ryte.com	Free; validates syntax and tests URLs
Screaming Frog	screamingfrog.co.uk	Desktop crawler; tests during site audit
SEOptimer	seoptimer.com/robots-txt-tester	Free; simple interface
Bing Webmaster Tools	bing.com/webmasters	Bing still has a robots.txt tester (useful for validating syntax)

Testing Workflow

Create or edit your robots.txt file locally
Run it through a third-party tester with specific URLs you want to verify are allowed or blocked
Confirm the results match your intentions
Upload to your server
Verify via browser in private mode
Check Search Console robots.txt report
Request a recrawl if you’ve made significant changes

21. Frequently Asked Questions

Q: Does Google require me to submit my robots.txt file?

No. Google automatically discovers and reads your robots.txt file without any manual submission. What you can do in Search Console is monitor whether Google has found it successfully and request a faster cache refresh after making changes.

Q: How long does it take for Google to pick up changes to robots.txt?

During the automatic crawling process, Google’s crawlers notice changes you made to your robots.txt file and update the cached version every 24 hours. If you need it faster, use the Request a Recrawl option in the Search Console robots.txt report.

Q: Can I have more than one robots.txt file?

Each domain and subdomain can have only one robots.txt file, located at its root. example.com/robots.txt does not apply to shop.example.com — that subdomain needs its own shop.example.com/robots.txt.

Q: My site is blocked by robots.txt in Search Console — is this always a problem?

Not necessarily. If the blocked pages are ones you intentionally do not want crawled (admin panels, staging directories, internal search pages), the report is purely informational. It is only a problem if pages you want indexed are showing as blocked.

Q: Will blocking a page with robots.txt hurt its rankings?

Yes, if the blocked page was previously indexed and you want it to rank. Blocking a page prevents Google from reading its content, which means it cannot be evaluated for relevance. Over time, Google may drop blocked pages from its index entirely — or retain them as empty URL entries.

Q: Can I block specific Googlebot bots (e.g., Googlebot-Image)?

Yes. Google has multiple specialised crawlers with specific user agent names:

User-agent: Googlebot-Image
Disallow: /       # Block all images from Google Image Search

User-agent: Googlebot-Video
Disallow: /videos/  # Block specific video directory

Q: Is the robots.txt Tester completely gone from Google Search Console?

Yes. Google has sunset the robots.txt tester. It has been replaced by the robots.txt report under Settings, and by the URL Inspection Tool for testing specific URLs. Third-party tools (listed in Section 20) are now the recommended way to test robots.txt syntax and URL matching.

Q: What happens if I delete my robots.txt file?

If Google returns a 404 when fetching robots.txt, it treats this as “no restrictions” and will crawl all pages on your site. This is not harmful for most sites — it simply means all pages are eligible to be crawled and indexed.

22. Final Summary Checklist

Use this checklist every time you create, update, or audit your robots.txt file:

Creating or Updating Robots.txt

[ ] File is named exactly robots.txt (lowercase, no extension)
[ ] File is saved in UTF-8 encoding
[ ] File is located at the root: https://yourdomain.com/robots.txt
[ ] Each directive is on its own line
[ ] Paths in Disallow rules start with /
[ ] Directories in Disallow rules end with /
[ ] Sitemap URL is included at the bottom
[ ] The file does NOT contain Disallow: / for all user agents (unless intentionally blocking all crawlers, e.g., staging)

Testing Before Publishing

[ ] Tested with a third-party tool (Merkle, Ryte, or Bing Webmaster Tools)
[ ] Verified all important URLs are not accidentally blocked
[ ] Verified all intended blocked URLs are blocked

After Publishing

[ ] Visited https://yourdomain.com/robots.txt in a private browser window — content displays correctly
[ ] Checked Google Search Console → Settings → robots.txt — Fetch status shows “Fetched”
[ ] Requested a recrawl in Search Console if changes are urgent
[ ] Checked Indexing → Pages for any new “Blocked by robots.txt” errors

Ongoing Maintenance

[ ] Review robots.txt after any major site restructure or migration
[ ] Add robots.txt review to your launch checklist for new sites and environments
[ ] Ensure staging environments have Disallow: / and production does not
[ ] Monitor Search Console robots.txt report monthly for new errors or warnings

This article is based on official Google Search documentation last updated November 2025, supplemented by verified SEO expert sources. For the most current information on Google’s robots.txt handling, refer to developers.google.com/search/docs/crawling-indexing/robots/intro and the robots.txt report help page.

Md Nasir Uddin

I’m Md Nasir Uddin, a digital marketing consultant with over 9 years of experience helping businesses grow through strategic and data-driven marketing. As the founder of Macroter, my goal is to provide businesses with innovative solutions that lead to measurable results. Therefore, I’m passionate about staying ahead of industry trends and helping businesses thrive in the digital landscape. Let’s work together to take your marketing efforts to the next level.