Next.js Sitemap, Robots and Crawlability Debugging Checklist

In Brief
Debug crawlability by reading the signals together: generated routes, sitemap.xml, robots.txt, status codes, redirects, canonicals, and noindex rules. A sitemap is only an output, not proof that the app can render the route or that crawlers are being invited to keep it in the index.
Crawlability problems in Next.js rarely announce themselves politely.
The site deploys. Pages load. The sitemap exists. The robots file exists. Everyone moves on.
Then Search Console starts showing missing URLs, excluded pages, duplicate canonicals, "discovered but not indexed", or routes that were never meant to be public. The technical problem may be one line in robots.txt, a stale sitemap generator, an environment‑specific host, a missing dynamic route, or a canonical helper that quietly normalises the wrong URL.
The fix starts with one principle: discovery surfaces must be generated from the same truth as the site itself. If routes, sitemap entries, canonical URLs, and robots rules are maintained separately, drift is inevitable.
Confirm Which Pages Should Exist
Before debugging the sitemap, establish the intended route set.
For a Next.js site, that may come from:
- page files
- App Router segments
- generated static params
- CMS entries
- product or category data
- redirect maps
- service registries
- valid‑route JSON files
- legacy URL inventories
The sitemap should not be treated as the source of truth. It is an output. If the sitemap includes a page the app cannot render, or omits a page the app depends on, the problem is upstream.
For content‑heavy sites, I like having a generated route list that can be compared with the sitemap, navigation, redirects, and important internal links. The older article on generating `urllist.txt` from a sitemap is a small example of why plain route inventories are useful when debugging.
Check Sitemap Inclusion Against Canonical URLs
Every URL in the sitemap should be canonical, indexable, and useful.
Check for:
- non‑production hosts
- preview or branch URLs
- HTTP URLs on an HTTPS site
- URLs that redirect
- URLs that return 404
- URLs with
noindex - filtered or parameterised URLs without a strategy
- duplicate trailing slash variants
- category or pagination URLs that should not be submitted
- old routes that should now redirect
Google's sitemap documentation is clear that sitemaps help search engines discover URLs. They do not override poor canonical decisions, robots rules, or weak page quality.
For Next.js, sitemap generation often changes between Pages Router and App Router projects. App Router supports metadata files such as sitemap.ts and robots.ts, documented in the Next.js pages for sitemap.xml and robots.txt. Those tools are useful, but they still depend on correct data.
Check robots.txt for Environment Drift
Robots files often break because staging and production need different behaviour.
Common failures include:
- production accidentally disallowed
- staging accidentally allowed
- sitemap URL pointing at the wrong host
- old disallow rules blocking new routes
- rules copied from a previous platform
- wildcard rules blocking assets needed for rendering
- robots rules relied on for pages that should use
noindex
Remember the distinction: robots.txt controls crawling, not indexing by itself. If a URL is blocked from crawling but linked elsewhere, search engines may still know about it and may show limited information. If a page must not appear in search, use the correct noindex mechanism on a crawlable response.
Google's robots.txt guide is worth checking before using robots rules as a blunt exclusion tool.
Validate Dynamic Route Generation
Most serious Next.js sitemap problems involve dynamic routes.
The template exists, but the data source changes. A CMS entry is unpublished. A slug has been normalised differently. A route is excluded from static generation. A category page exists in navigation but never appears in the sitemap. A product page is generated, but only if it appears in a particular API response during build.
Check:
- how dynamic slugs are fetched
- whether draft or unpublished records are filtered correctly
- whether locale, market, or brand dimensions are included
- whether empty categories produce URLs
- whether deleted CMS entries are removed
- whether route generation fails silently
- whether pagination URLs are complete
- whether the sitemap and app use the same slug normalisation
This matters for migration work too. If a site has just moved from Gatsby, WordPress, Shopify, or a React SPA, the old URL estate needs to be compared with the new generated route set. Traffic dropped after a replatform covers the wider recovery process.
Check Canonical Helpers and Redirects Together
A page can appear crawlable while still sending conflicting signals.
For each affected URL, compare:
- requested URL
- final URL after redirects
- canonical URL
- Open Graph URL
- sitemap URL
- internal links pointing to it
- alternate language URLs
These should form a coherent story. If the sitemap submits /services/example/, internal links point at /services/example, redirects add a slash, and the canonical uses a preview host, search engines have to resolve avoidable noise.
Next.js makes redirects straightforward in many cases, but centralising redirects does not guarantee quality. Redirects need intent mapping. A retired URL should point to the best replacement, not just a convenient parent page.
Inspect Rendered Pages, Not Just Files
Do not stop after opening /sitemap.xml and /robots.txt.
Open representative rendered pages and check:
- title
- meta description
- canonical
- robots meta
- h1
- primary content
- internal links
- structured data
- pagination links
- status code
- response headers
This is especially important for pages that depend on CMS data or revalidation. A sitemap can contain a URL that looked valid at build time, but the rendered page may now return thin content, an error state, stale content, or a canonical to something else.
If CMS publishing or revalidation is involved, the related problem is covered in Next.js App Router cache tags and revalidation.
Use a Repeatable Crawlability Checklist
For every important template, check:
- Is the URL intended to be public?
- Does it return 200?
- Does it redirect?
- Is it in the sitemap?
- Is it blocked by robots.txt?
- Does it have
noindex? - Does the canonical point to itself or the correct representative?
- Is it linked internally?
- Does rendered HTML contain useful content?
- Does structured data match visible content?
- Is the URL present in generated route data?
- Is it present or absent in Search Console for the reason expected?
That list is deliberately plain. Most crawlability fixes are not clever. They are the result of comparing outputs that should agree and finding where they drifted.
Wrapping Up
In a Next.js site, crawlability depends on several generated and rendered surfaces agreeing with each other.
The sitemap, robots file, route generation, redirects, canonicals, internal links, and rendered page output all have to describe the same site. If they are maintained as separate bits of plumbing, they will eventually disagree.
The best fix is not a bigger sitemap. It is a route and discovery system that can be checked, regenerated, and trusted.
Key Takeaways
- Treat the sitemap as an output, not the source of truth.
- Compare generated routes, sitemap URLs, canonicals, redirects, and internal links.
- Keep production and staging robots rules separate and deliberate.
- Pay special attention to dynamic routes from CMSes and e‑commerce data.
- Inspect rendered page output before deciding the sitemap is correct.
- Use crawl evidence to find drift between the route set and discovery surfaces.