Seo May 19, 2026 5 min read

Robots.txt Setup Done Right: Rules, Examples, and Gotchas

Learn how to set up robots.txt correctly with real directives, working examples, and common mistakes that silently block your site from Google.

A misconfigured robots.txt file is one of the fastest ways to vanish from search results. It's a plain text file, but the rules are picky — one stray slash and you've blocked your entire site from Googlebot. This walkthrough covers exactly how to set up robots.txt correctly, with directives that actually work in production and the edge cases that trip up most teams.

What robots.txt actually does (and what it doesn't)

The robots.txt file lives at the root of your domain — https://example.com/robots.txt — and tells crawlers which paths they're allowed to fetch. It's a request, not an enforcement mechanism. Reputable bots (Googlebot, Bingbot, Applebot) respect it. Malicious scrapers ignore it entirely.

Critical distinction: robots.txt controls crawling, not indexing. If a blocked page has external links pointing to it, Google can still index the URL without ever fetching the content. To prevent indexing, use a noindex meta tag or an X-Robots-Tag HTTP header — and make sure the page isn't blocked in robots.txt, or Google can't read the noindex directive.

The core syntax you need to know

Every rule sits inside a group that starts with a User-agent line. Within each group, you list Allow and Disallow directives.

The four directives that matter

User-agent: — which bot the rules apply to. Use * for all crawlers.
Disallow: — paths the bot should not crawl.
Allow: — exceptions to a Disallow rule (useful for nested paths).
Sitemap: — absolute URL to your XML sitemap. Can appear anywhere in the file.

Path matching rules

Paths are case-sensitive: /Admin/ and /admin/ are different.
* matches any sequence of characters.
$ anchors a match to the end of the URL.
A trailing slash matters: Disallow: /private blocks /private and /private/page; Disallow: /private/ only blocks the directory contents.

A working robots.txt template

Here's a sensible starting point for a typical CMS-driven site:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /search?
Disallow: /*?sessionid=
Disallow: /*.pdf$
Allow: /admin/public-assets/

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

What's happening here:

All bots are blocked from admin, cart, checkout, and internal search pages.
Any URL with a sessionid parameter is blocked to prevent duplicate-content crawls.
PDF files are excluded from crawling using the $ end-anchor.
An Allow rule carves out a public subdirectory inside /admin/.
AI training crawlers (GPTBot, CCBot) are blocked outright.
The sitemap is declared at the bottom so search engines find it without guesswork.

Common mistakes that quietly destroy rankings

1. Blocking CSS and JavaScript

Old advice said to block /wp-includes/ or /assets/js/. Don't. Googlebot renders pages like a browser and needs your CSS and JS to evaluate layout, mobile-friendliness, and Core Web Vitals. Blocking them tanks rendering scores.

2. Using robots.txt to hide sensitive URLs

Anything listed in Disallow is public — your robots.txt is itself a public file. Listing /secret-launch-page/ there is an open invitation. Use authentication or noindex instead.

3. A single stray slash

Disallow: / blocks the entire site. This is the most common catastrophic typo, usually copied over from a staging server's robots.txt during a deploy.

4. Conflicting Allow and Disallow rules

When rules conflict, Google uses the most specific rule (longest matching path), not the first one. Allow: /blog/public/ beats Disallow: /blog/ for URLs under /blog/public/.

5. Forgetting the file is per-protocol and per-subdomain

https://example.com/robots.txt doesn't apply to https://shop.example.com/ or http://example.com/. Each host needs its own.

Testing before you deploy

Never push a robots.txt change straight to production. The cost of a mistake is days of lost crawl coverage. Run the file through a validator that simulates how Googlebot interprets each directive against your actual URLs.

The AXOX Hub Robots.txt Analyzer parses your file, flags syntax errors, identifies overly broad disallow patterns, and lets you test specific URLs against the ruleset to see whether they'd be blocked or allowed. It also surfaces the most common footguns — like missing sitemap declarations or blocked assets — before they hit production.

User-agent specificity and ordering

Crawlers pick the most specific user-agent block that matches their name, and only that block. They don't merge rules across groups.

User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow: /no-google/

In this example, Googlebot will only obey the second group — it will happily crawl /private/. If you want Googlebot to follow both rules, you have to repeat them in its block.

Handling parameters, faceted navigation, and pagination

E-commerce and large content sites generate millions of parameterised URLs that waste crawl budget. Robots.txt is a blunt but effective tool for this.

Filter parameters: Disallow: /*?color= and Disallow: /*&color=
Sort orders: Disallow: /*?sort=
Internal search: Disallow: /search?
Tracking parameters: handle via canonical tags instead — blocking them prevents Google from consolidating link signals.

Pagination URLs (?page=2) should usually not be blocked. Google needs to crawl them to discover deeper content. Use rel="canonical" pointing to themselves and let Google figure out the rest.

Validating against your live site

After deploying, confirm two things:

The file is served with a 200 OK status and Content-Type: text/plain. A 404 means crawlers assume everything is allowed; a 5xx response can pause crawling entirely.
Spot-check key URLs — your homepage, category pages, top blog posts, and any newly-launched landing pages — to confirm they're crawlable.

Run your final file through the Robots.txt Analyzer at AXOX Hub to validate syntax, test specific URLs, and catch the silent mistakes before Googlebot does.

Try the free tool

Open Tool

← Back to Blog