Block Bad Bots Using .htaccess

Hero image for Block Bad Bots Using .htaccess. Image by Eleventh Wave.
Hero image for 'Block Bad Bots Using .htaccess.' Image by Eleventh Wave.

It is astonishing to think that 2012 was the year that traffic generated by automated bots and spiders on the internet outgrew human traffic. Since then, bots and spiders have only increased in their virility and use.

Whilst many bots are 'good' bots that do things like look through your site to calculate your position in search engine indexes or help you find the cheapest deal on your car insurance, many others are far less benign and more hostile: probing your website for security flaws or systems they can exploit.

If you have ever had a WordPress website hacked (and really who hasn't), it would inevitably have been a bot, rather than an actual human mortal, doing the hacking.

If you spot bots in your server history that are behaving oddly; maybe trying to access different variations of admin or wp URLs on your site in the hope of finding a login, or just simply clogging your website down with irrelevant traffic, there are steps you can take to banish them.

The important thing to bear in mind here is that these solutions rely on the UserAgent string. A really nefarious bot developer would likely change this often so you may lose the battle with an insistent android, but for the most part, these steps will help.

Photograph of a green toy robot by Phillip Glickman on Unsplash.

robots.txt

The first step is inside robots.txt. This is a file in the root of your domain which politely tells bots whether you would really rather they gave your website a skip. It's a little like those 'no soliciting' stickers your grandparents have on their front door, and about as useful.

Nevertheless, here is an example of how to block a bot called 'CuteStat' in your robots.txt file:

User-agent: CuteStatDisallow: /

This simply says "If you are a bot called CuteStat, you are not allowed anywhere beneath the root of this domain". This is actually a genuine example CuteStat are incredibly annoying, but at least they do pay attention to this disallow...

You can see my robots.txt file here if you are interested. As I mentioned though, these are really only useful to 'good' robots who actually pay any attention to the robots.txt standard. Like those religious doorstep visitors who you simply cannot stop interrupting your tea time, bad robots, and bad robot developers will simply ignore it.

Photograph of a blue plastic robot by Rock'n Roll Monkey on Unsplash.

.htaccess

Your second option is a little more technical, and will only work if you're on an Apache server with access to .htaccess. Here, you can query a visitor's UserAgent string and determine whether or not to allow them access to the site:

SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_botSetEnvIfNoCase User-Agent .*dotbot.* bad_bot<Limit GET POST HEAD>  Order Allow, Deny  Allow from all  Deny from env=bad_bot</Limit>

Here, we are setting a variable called bad_bot based on whether the UserAgent contains specific strings, and then allows everybody to access the site, unless that variable is true.

I've left a couple of bot examples in the code block above, but you could create a list as long as necessary simply by replicating the first line and changing the bot name, for example:

SetEnvIfNoCase User-Agent .*Go-http-client* bad_bot

One quick word of warning: the more directives you have in your .htaccess file, the more CPU and memory your server needs to use to serve your website. Consequently, this can in theory slow your website down and reduce the TTFB.

That said, the difference at this level is imperceivable, just don't put hundreds and hundreds in there!


Categories:

  1. Copyright
  2. Development
  3. htaccess
  4. Search Engine Optimisation