Blocking bad bots on your Drupal site

Submitted by George Moses on Sat, 2010-11-13 09:31

Sometimes you find your Drupal website being hit by rogue robots who index your website disrespecting robots.txt guidelines. A standard Drupal installation comes with a predefined robots.txt file which defines which paths robots.txt should ignore and defines the pace in which you want your website to be indexed. Some robots are just site copiers, page grabbers or home projects which will hit your website hard by indexing all pages in a way which resembles a (D)DOS attack. With a sufficient large Drupal website or large community you can easily find your website becoming unresponsive for some time.

Blocking bots by Agent string

Fortunately, there is an easy way for blocking those robots. Blocking bad robots can be done by Apache before Drupal is called, by examining the User Agent string which accompanies each HTTP request. You can use the following configuration on Apache host level or .htaccess file:

<IfModule mod_setenvif.c>
  BrowserMatch "Mozilla/4\.78 \(TuringOS; Turing Machine; 0\.0\)" isnastyagent
  BrowserMatch "Turing" isnastyagent
  BrowserMatch "Wget" isnastyagent
  BrowserMatch "SiteSucker" isnastyagent
  BrowserMatch "Mozilla/4\.5 \(compatible; HTTrack 3\.0x; Windows 98\)" isnastyagent
  BrowserMatch "HTTrack" isnastyagent
  BrowserMatch "Indy Library" isnastyagent
  BrowserMatch "grub-client" isnastyagent
  BrowserMatch "Vagabondo" isnastyagent
  BrowserMatch "NaverBot" isnastyagent
  BrowserMatch "MSIECrawler" isnastyagent
  BrowserMatch "Stumbler" isnastyagent
  BrowserMatch "Tcl http client" isnastyagent
  BrowserMatch "Zao" isnastyagent
  BrowserMatch "curl" isnastyagent
  BrowserMatch "Ocelli" isnastyagent
  BrowserMatch "WebReaper" isnastyagent
  BrowserMatch "WebCopier" isnastyagent
  BrowserMatch "GallileoDWS" isnastyagent
  BrowserMatch "QweeryBot" isnastyagent
  BrowserMatch "Tasapspider" isnastyagent
  BrowserMatch "DepSpid" isnastyagent
  BrowserMatch "ShopWiki" isnastyagent
  BrowserMatch "P.Arthur" isnastyagent
  BrowserMatch "Nutch" isnastyagent
  BrowserMatch "TwengaBot" isnastyagent
  SetEnvIf Remote_Addr ^10\.11\.12\.13 leacher
</IfModule>

<Directory /srv/www/vhosts/mo6.nl/httpdocs>
  AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css text/x-js
  Order Allow,Deny
  Allow from all
  Deny from env=isnastyagent env=leacher
</Directory>

User Agent strings can easily be changed by robots or even be empty. So there's also a need to block rogue robots by IP address or block. The above example also lets you blocking IP addresses. Drupal also has the ability to define IP addresses to be blocked, but does this in an inefficient way. Furthermore, I have added a filter to compress static text (html, RSS xml files and CSS files) so this will speed up your site by reduced the amount of data sent from the webserver to the webclient. This require the mod_deflate module to be actived on Apache. Drupal 6 and Drupal 7 offer an option to compress data, but Apache can handle this more efficient and also for CSS and JS files. Note that if you use Apache compressing you'll have to disable the Drupal compressing options in admin/settings/performance.

Please adjust the above configuration to your own needs before deploying it to your Drupal site.

Blocking bots by IP address

You can also block access to your Drupal site from certain IP address by defining the IP address in the "Deny from" clause in the above example. I use the method of defining IP address using SetEnvIf so I can centralize the "bad robots and IP addresses" configuration and share this among multiple Drupal installations.