abusive crawler

Posted by joy


This is a stupid spam harvester hitting my site right now. For those of you Web masters who read this blog, you can go ahead and be a bit proactive by blocking 208.66.195.0/28 block from your site right now. This abusive crawler, originating from West coast cogentco colo has been hitting my site at a rate of one new request every four seconds for the past couple of minutes.

Not only is it a bad crawler for it’s abusive crawling activities, but the user agent is spoofing Internet Explorer. Not suprisingly, Spamhaus has more here and here

Host: 208.66.195.7
/2005/06/07/i-owe/
Http Code: 200 Date: Aug 27 11:33:07 Http Version: HTTP/1.1 Size in Bytes: 14175
Referer: -
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

[tags]spam, spamming, crawlers, not cool [/tags]


5 Responses to “abusive crawler”

  1. Don McArthur Says:

    Thanks for sharing. In exchange, here’s my current /etc/hosts.deny sshd collection:

    http://www.mcarthurweb.com/archive.php?item=224

  2. Thad Says:

    If I may ask, what’s so abusive about once every 4 seconds? In all seriousness, that sounds like a request rate that a Commodore 64 ought to be able to keep up with.

    I mean, sure, spam harvesting = teh suck, but I’m referring only the crawler’s behavior itself.

  3. joy Says:

    Thad, Because too many requests per minute will take down the site. This is especially true of database driven sites like this one.

    I don’t need a non-legit source of traffic to add to the load.

  4. Thad Says:

    OK. The only web content I’ve ever done has been static, so I don’t have any feel for how much load a database-driven site can create. I’ll believe it when you say an incessant request every 4 seconds could contribute to hosing a database-driven site.

    I mean, after all, there were web servers for 8-bit embedded processors back in ‘99, devices that are fast only when compared to geological measures. I looked into this for my company at that time. (In that application, the content would have been dynamically generated, though not from a database.) Given what an eternity 4 seconds is for even a lowly 1GHz processor these days….

    So, note to self: re-write spam-bot so it pauses for 5 seconds…. beta test on cleverhack to see if it passes muster.

    ;-)

  5. joy Says:

    Thad,

    Well, if you think about it this way, every server has finite resources. So, if you’re running a Webserver and a database and a mail MTA, server processes add up.

    In addition, I’m on a shared server, so I’m sure my hosting provider would rather have me ban a bad bot than have it not only stress my site, but potentially stressing the rest of the server. Read here for what can happen in a case like that.

Leave a Reply