Automatically Generate Text Sitemaps in Gatsby

Hero image for Automatically Generate Text Sitemaps in Gatsby. Image by Vishnu Mohanan.
Hero image for 'Automatically Generate Text Sitemaps in Gatsby.' Image by Vishnu Mohanan.

A few years back, I wrote an article about generating a urllist.txt file from sitemap.xml. This used PHP to deconstruct the xml file and return a simple list of URLs on the fly, this worked well for me at the time because my hosting situation meant I could still use PHP alongside staticallygenerated sites.

Now that I'm trying to do away with that by moving to a more decentralised and static host. I'm coming back across the same issue I had back then: gatsbypluginsitemap only generates a sitemap in the xml format (as sitemap.xml) and not as I still want to include textformatted sitemaps like urllist.txt.


What are Text Formatted Sitemaps?

A textbased sitemap is a simple, plain text file that lists all the URLs of a website. Each URL is presented on a new line, creating a clear and concise map of the site's structure. These sitemaps are typically named sitemap.txt or urllist.txt, you can see mine here: sitemap.txt, and urllist.txt. The eagleeyed amongst you will recognise that they are identical, because they are it's just two different file naming conventions from two different eras of the web.

Back in the early days, textformatted sitemaps were used to submit your website content specifically for the Yahoo search engine. However, when Yahoo also announced support for XML sitemaps, the use of the textformatted versions dwindled. Nevertheless, they still play a role in SEO: search engines will still accept them alongside their XML cousins, and due to their simplicity, some argue that search engine crawlers find it easier to discover and index your pages, which could improve your site's visibility and SEO performance.


Generating sitemap.txt and urllist.txt

Stepping away from the idea of using a backend technology to generate these sitemaps, I've instead been focusing on what Gatsby can offer.

After all, during the build process, in gatsbynode.js we have access to:

  • Node.js, which means we have access to fs to create and manipulate files;
  • GraphQL, which means we can run queries;
  • allSitePage which returns a list of all page paths within your site.

So it should be simple, right?

Implementing onPostBuild

It took me a little bit of trial and error, but it turns out the ideal place to generate our textformatted sitemaps is during onPostBuild. This runs after the build process is complete, which means we can be sure all the pages have been created before generating the sitemaps.

Get the Data We Need

There are two pieces of data we need:

  • A list of all paths, which comes from allSitePage;
  • The site URL, which comes from siteMetadata assuming you've set that up in GatsbyConfig.

My query looks like this:

{  allSitePage {    nodes {      path    }  }  site {    siteMetadata {      siteUrl    }  }}

Filter Out Error Pages

This is an optional step depending on whether you've opted to set up error pages or not, but for me: I have a 404 page, which is served when a user attempts to access a link that doesn't exist. I don't want that included in the sitemaps, so we can filter that out very simply:

nodes.filter(node => !node.path.includes('404'))

Obviously, your application may have other URLs that you don't want to include in your sitemap too, in which case you can simply extend the filter to remove them.

Use Node.js fs to create the file

With Gatsby, anything in the public folder at buildtime will be placed in the root of the live site. We don't want to use the static folder at this point (the contents of which get copied over into public) because the copy has already happened, and for many, the static folder is versioncontrolled whilst the public folder should not be.

So, we can use writeFileSync to output our new files. Something like this:

const filePath = path.join(__dirname, 'public', 'sitemap.txt');fs.writeFileSync(filePath, sitemapContent);

The Full Solution

Piecing it all together, what you have is a block of code that sits at the bottom of your gatsbynode.ts file, and looks like this:

exports.onPostBuild = async ({ graphql, reporter }) => {  try {    const result = await graphql(`      {        allSitePage {          nodes {            path          }        }        site {          siteMetadata {            siteUrl          }        }      }    `);    if (result.errors) {      reporter.panic('Error in the GraphQL query for sitemap: ', result.errors);      return;    }    const sitemapContent = result.data.allSitePage.nodes      .filter((node) => !node.path.includes('404'))  // Filtering out paths that contain '404'      .map((node) => `${result.data.site.siteMetadata.siteUrl}${node.path}`)      .join('\n');    const sitemapPath = path.join(__dirname, 'public', 'sitemap.txt');    const urllistPath = path.join(__dirname, 'public', 'urllist.txt');    // Write to sitemap.txt    fs.writeFileSync(sitemapPath, sitemapContent);    reporter.info('Successfully created sitemap.txt.');    // Write to urllist.txt    fs.writeFileSync(urllistPath, sitemapContent);    reporter.info('Successfully created urllist.txt.');  } catch (error) {    reporter.panic('Failed to create sitemap and urllist files: ', error);  }};

What you will find at the end of your build is two new txt files in the root of your site (urllist.txt and sitemap.txt), both containing a complete list of every URL on your site.

And that's it!


Categories:

  1. Development
  2. Gatsby
  3. Guides
  4. Search Engine Optimisation
  5. Sitemaps