Yesterday, I saw this site was timing out with HTTP 503 (Service Temporarily Unavailable) and I couldn’t access my WordPress blog. Whut?
I logged into my shared hosting shell, looked at the server logs, and saw I was being hammered by some scraper bot. So of course Apache’s mod_security was throttling my site and returning 503’s to everything.
So, what to do? Edit the .htaccess to tell the bot to fuck off.
# scraper bots
#"Scrapy/2.11.2 (+https://scrapy.org)"
RewriteCond %{HTTP_USER_AGENT} ^Scrapy [NC]
RewriteRule . - [F,L]
Essentially, if the user agent starts with “Scrapy” (No Case sensitivity), tell it that it’s Forbidden (HTTP 403) and end the request. (see a full example)
That’s the first bot I’ve had to block with mod_rewrite, but I get the sinking feeling it won’t be the last. At least it’s honest enough to announce a unique user agent string instead of saying it’s Mozilla or some shit.
Nobody obeys robots.txt anymore. Nobody.
I fucking hate AI. Assholes Incorporated. But AI didn’t do this; humans did this to the Web. We did it. Humans. We did it to each other. We did it to ourselves. Humans.