Seo May 21, 2026 5 min read

XML Sitemaps at Scale: Best Practices for Large Sites

Learn xml sitemap best practices for large sites — splitting, indexing, freshness signals, and validation tactics that keep millions of URLs crawlable.

Running a sitemap for a 50-page brochure site is trivial. Running one for an ecommerce catalog with 4 million SKUs, a news archive, or a marketplace with constant URL churn is a different problem entirely. At scale, sitemaps stop being a nice-to-have file and become the primary signal you give search engines about which URLs matter, when they changed, and how to prioritize crawl budget.

Below are the practices that actually move the needle when you're dealing with hundreds of thousands — or millions — of URLs.

Respect the hard limits, then plan well below them

Google and Bing both enforce:

  • 50,000 URLs per sitemap file
  • 50 MB uncompressed file size
  • 50,000 sitemaps per sitemap index

That gives you a theoretical ceiling of 2.5 billion URLs across one index. In practice, don't push these limits. Aim for 10,000–20,000 URLs per file. Smaller files parse faster, fail more gracefully, and make it easier to diagnose which segment of your site has indexing problems.

Why smaller files win

  • If one file returns a 500 or contains malformed XML, you lose 10,000 URLs from the crawl queue instead of 50,000.
  • Search Console reports indexing stats per sitemap. Smaller, logically grouped files give you sharper diagnostics.
  • Re-generating a 10k file on content change is cheaper than rebuilding a 50k file.

Split sitemaps by content type, not just by number

Chunking purely by URL count (sitemap-1.xml, sitemap-2.xml…) wastes the diagnostic value sitemaps provide. Instead, split by logical groupings so you can isolate indexing issues by section.

A practical structure for a large ecommerce site

  • /sitemaps/products-active.xml — in-stock product pages
  • /sitemaps/products-archive.xml — discontinued but still indexable
  • /sitemaps/categories.xml — taxonomy pages
  • /sitemaps/brands.xml
  • /sitemaps/blog.xml
  • /sitemaps/static.xml — about, contact, policy pages

Then reference them all from a single sitemap-index.xml. When Search Console tells you only 62% of URLs in products-active.xml are indexed, you know exactly where to look.

Use lastmod accurately — or don't use it at all

Google's John Mueller has been explicit: if your <lastmod> values aren't trustworthy, Google ignores them. Worse, consistently inaccurate timestamps can make crawlers deprioritize your entire sitemap.

Get lastmod right

  1. Only update lastmod when meaningful content changes. A footer copyright update or a sidebar widget swap is not a content change.
  2. Use ISO 8601 format with timezone: 2024-11-14T09:30:00+00:00.
  3. Don't set lastmod to the current timestamp on every regeneration. This is the most common mistake — your CMS rebuilds the sitemap nightly and stamps everything as updated today.
  4. Persist the real last-edit timestamp in your database and read it during sitemap generation.

If you can't guarantee accuracy, omit lastmod entirely. Skip priority and changefreq too — Google has confirmed it ignores them.

Only include URLs you actually want indexed

Your sitemap should represent your canonical, indexable URL set. Every URL in there should:

  • Return a 200 status code
  • Not be blocked by robots.txt
  • Not have a noindex directive
  • Be the canonical version (no parameter variants, no duplicates)
  • Be reachable via internal links

A sitemap full of redirects, 404s, or noindexed URLs sends conflicting signals and burns crawl budget. Run regular audits with a tool like the AXOX Hub Sitemap Checker to catch broken URLs, redirect chains, and noindex pages that snuck in.

Compress and serve sitemaps efficiently

Gzip your sitemap files

Serve sitemaps as .xml.gz. A 40 MB sitemap typically compresses to 3–5 MB. The 50 MB uncompressed limit still applies, but transfer is faster and cheaper.

Set correct headers

  • Content-Type: application/xml
  • Content-Encoding: gzip (for .gz files)
  • Avoid cache headers that prevent crawlers from seeing fresh versions

Host sitemaps on the same domain

Cross-domain sitemaps require verified ownership in Search Console. Keep them on the same host as the URLs they list.

Reference sitemaps in robots.txt and submit them

Both matter. robots.txt is for discovery by any crawler; Search Console submission gets you reporting.

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap-index.xml

Only list the index file in robots.txt — not every individual sitemap.

Handle dynamic and high-churn content

News and time-sensitive content

For news sites, maintain a separate News sitemap containing only articles published in the last 48 hours. Keep it under 1,000 URLs.

User-generated content and frequent updates

If you have URLs that change constantly (forums, listings, marketplaces):

  • Use a dedicated sitemap for the most recently created/updated URLs
  • Regenerate that single file on a short interval (every 15–60 minutes)
  • Keep the bulk archive in separate sitemaps regenerated nightly

This pattern minimises compute cost while giving search engines a fresh, accurate signal for the URLs that matter most right now.

Validate before deploying

Malformed XML breaks everything downstream. Before pushing changes:

  1. Validate XML syntax — a single unescaped ampersand will invalidate the file
  2. Check all URLs resolve correctly — sample 100+ URLs and verify status codes
  3. Confirm encoding — UTF-8, with proper escaping for &, <, >, ', "
  4. Verify the index file points to URLs that actually exist
  5. Test gzip integrity if you're compressing

Monitor indexing rate, not just submission count

Submitting 2 million URLs and getting 400,000 indexed isn't a sitemap problem in isolation — but the sitemap is where you'll spot the pattern. Track per-sitemap indexation ratios in Search Console weekly. If products-archive.xml sits at 12% indexed while products-active.xml is at 89%, that's a content quality signal, not a sitemap bug.

Red flags to watch for

  • Sudden drops in "Discovered - currently not indexed"
  • Spikes in "Crawled - currently not indexed"
  • Sitemap fetch errors or 404s on the sitemap URL itself
  • Long delays between lastmod update and recrawl

Audit your current setup

If you haven't reviewed your sitemap structure in the last six months, you almost certainly have stale URLs, redirect chains, or noindexed pages cluttering the file. Run your sitemap index through the free AXOX Hub Sitemap Checker to surface broken URLs, status code issues, and structural problems before they cost you crawl budget.

Try the free tool

Open Tool