Kill All Serial Scrapists

Yesterday, I saw this site was timing out with HTTP 503 (Service Temporarily Unavailable) and I couldn’t access my WordPress blog. Whut?

I logged into my shared hosting shell, looked at the server logs, and saw I was being hammered by some scraper bot. So of course Apache’s mod_security was throttling my site and returning 503’s to everything.

So, what to do? Edit the .htaccess to tell the bot to fuck off.

# scraper bots
#"Scrapy/2.11.2 (+https://scrapy.org)"
RewriteCond %{HTTP_USER_AGENT} ^Scrapy [NC]
RewriteRule . - [F,L]

Essentially, if the user agent starts with “Scrapy” (No Case sensitivity), tell it that it’s Forbidden (HTTP 403) and end the request. (see a full example)

That’s the first bot I’ve had to block with mod_rewrite, but I get the sinking feeling it won’t be the last. At least it’s honest enough to announce a unique user agent string instead of saying it’s Mozilla or some shit.

Nobody obeys robots.txt anymore. Nobody.

I fucking hate AI. Assholes Incorporated. But AI didn’t do this; humans did this to the Web. We did it. Humans. We did it to each other. We did it to ourselves. Humans.

Published by Shawn

He's just this guy, you know?