Identify your (Web) spider

Posted by joy


Today Slashdot had a front page article about how to create a Web spider on Linux. Aside from the fact that the subject matter just totally excites my inner nerd, I wanted make a point especially for those who would be writing a spider, bot or crawler for fun and profit.

I have this true story about how, not so long ago, I was a Webmaster. One very busy morning, I had a crawler that was hitting my site and it was annoying the heck out of me as it was a little too aggressive. I really wanted to ban it, but I saw a URL in the User Agent, and so I tracked down the source. The homepage for the bot at the time looked like this and the site it was crawling for wasn’t live yet. At that point, I had a choice - I could just ban the bot and be done with it or allow the bot to run and hope that the not yet live site would someday provide some benefit.

As it turns out, I held my nose and allowed the bot to run. In fact, a few weeks later, it did slow down and was friendlier - so I didn’t mind it as much. The other part to this story is that the site in question went live in April 2006 - and it did show the crawled content.

In other words, if your bot is legit, identify it or face the chance that you could be banned from the very sites you want to crawl. While the shopwiki example isn’t the best example of a parked page, at least I had some information to go on as a Webmaster.

[tags]Spider, Bot, Crawler, User Agent, Webmaster, Web Admin [/tags]


Leave a Reply