XML Sitemaps at Scale: Best Practices for Large Sites
Learn xml sitemap best practices for large sites — splitting, indexing, freshness signals, and validation tactics that keep millions of URLs crawlable.
Running a sitemap for a 50-page brochure site is trivial. Running one for an ecommerce catalog with 4 million SKUs, a news archive, or a marketplace with constant URL churn is a different problem entirely. At scale, sitemaps stop being a nice-to-have file and become the primary signal you give search engines about which URLs matter, when they changed, and how to prioritize crawl budget.
Below are the practices that actually move the needle when you're dealing with hundreds of thousands — or millions — of URLs.
Respect the hard limits, then plan well below them
Google and Bing both enforce:
- 50,000 URLs per sitemap file
- 50 MB uncompressed file size
- 50,000 sitemaps per sitemap index
That gives you a theoretical ceiling of 2.5 billion URLs across one index. In practice, don't push these limits. Aim for 10,000–20,000 URLs per file. Smaller files parse faster, fail more gracefully, and make it easier to diagnose which segment of your site has indexing problems.
Why smaller files win
- If one file returns a 500 or contains malformed XML, you lose 10,000 URLs from the crawl queue instead of 50,000.
- Search Console reports indexing stats per sitemap. Smaller, logically grouped files give you sharper diagnostics.
- Re-generating a 10k file on content change is cheaper than rebuilding a 50k file.
Split sitemaps by content type, not just by number
Chunking purely by URL count (sitemap-1.xml, sitemap-2.xml…) wastes the diagnostic value sitemaps provide. Instead, split by logical groupings so you can isolate indexing issues by section.
A practical structure for a large ecommerce site
/sitemaps/products-active.xml— in-stock product pages/sitemaps/products-archive.xml— discontinued but still indexable/sitemaps/categories.xml— taxonomy pages/sitemaps/brands.xml/sitemaps/blog.xml/sitemaps/static.xml— about, contact, policy pages
Then reference them all from a single sitemap-index.xml. When Search Console tells you only 62% of URLs in products-active.xml are indexed, you know exactly where to look.
Use lastmod accurately — or don't use it at all
Google's John Mueller has been explicit: if your <lastmod> values aren't trustworthy, Google ignores them. Worse, consistently inaccurate timestamps can make crawlers deprioritize your entire sitemap.
Get lastmod right
- Only update lastmod when meaningful content changes. A footer copyright update or a sidebar widget swap is not a content change.
- Use ISO 8601 format with timezone:
2024-11-14T09:30:00+00:00. - Don't set lastmod to the current timestamp on every regeneration. This is the most common mistake — your CMS rebuilds the sitemap nightly and stamps everything as updated today.
- Persist the real last-edit timestamp in your database and read it during sitemap generation.
If you can't guarantee accuracy, omit lastmod entirely. Skip priority and changefreq too — Google has confirmed it ignores them.
Only include URLs you actually want indexed
Your sitemap should represent your canonical, indexable URL set. Every URL in there should:
- Return a 200 status code
- Not be blocked by
robots.txt - Not have a
noindexdirective - Be the canonical version (no parameter variants, no duplicates)
- Be reachable via internal links
A sitemap full of redirects, 404s, or noindexed URLs sends conflicting signals and burns crawl budget. Run regular audits with a tool like the AXOX Hub Sitemap Checker to catch broken URLs, redirect chains, and noindex pages that snuck in.
Compress and serve sitemaps efficiently
Gzip your sitemap files
Serve sitemaps as .xml.gz. A 40 MB sitemap typically compresses to 3–5 MB. The 50 MB uncompressed limit still applies, but transfer is faster and cheaper.
Set correct headers
Content-Type: application/xmlContent-Encoding: gzip(for .gz files)- Avoid cache headers that prevent crawlers from seeing fresh versions
Host sitemaps on the same domain
Cross-domain sitemaps require verified ownership in Search Console. Keep them on the same host as the URLs they list.
Reference sitemaps in robots.txt and submit them
Both matter. robots.txt is for discovery by any crawler; Search Console submission gets you reporting.
User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap-index.xml
Only list the index file in robots.txt — not every individual sitemap.
Handle dynamic and high-churn content
News and time-sensitive content
For news sites, maintain a separate News sitemap containing only articles published in the last 48 hours. Keep it under 1,000 URLs.
User-generated content and frequent updates
If you have URLs that change constantly (forums, listings, marketplaces):
- Use a dedicated sitemap for the most recently created/updated URLs
- Regenerate that single file on a short interval (every 15–60 minutes)
- Keep the bulk archive in separate sitemaps regenerated nightly
This pattern minimises compute cost while giving search engines a fresh, accurate signal for the URLs that matter most right now.
Validate before deploying
Malformed XML breaks everything downstream. Before pushing changes:
- Validate XML syntax — a single unescaped ampersand will invalidate the file
- Check all URLs resolve correctly — sample 100+ URLs and verify status codes
- Confirm encoding — UTF-8, with proper escaping for
&,<,>,'," - Verify the index file points to URLs that actually exist
- Test gzip integrity if you're compressing
Monitor indexing rate, not just submission count
Submitting 2 million URLs and getting 400,000 indexed isn't a sitemap problem in isolation — but the sitemap is where you'll spot the pattern. Track per-sitemap indexation ratios in Search Console weekly. If products-archive.xml sits at 12% indexed while products-active.xml is at 89%, that's a content quality signal, not a sitemap bug.
Red flags to watch for
- Sudden drops in "Discovered - currently not indexed"
- Spikes in "Crawled - currently not indexed"
- Sitemap fetch errors or 404s on the sitemap URL itself
- Long delays between
lastmodupdate and recrawl
Audit your current setup
If you haven't reviewed your sitemap structure in the last six months, you almost certainly have stale URLs, redirect chains, or noindexed pages cluttering the file. Run your sitemap index through the free AXOX Hub Sitemap Checker to surface broken URLs, status code issues, and structural problems before they cost you crawl budget.
Try the free tool
Open Tool