Archive for the 'Tech' Category

Entireweb Speedy Spider

Monday, November 20th, 2006

Speedy Spider is the crawler for the Sweden based search engine Entireweb. I have not seen any referrers from Entireweb, but their Speedy Spider featured a URL to the informative Speedy Spider FAQ. In addition, Speedy Spider is quite polite for a bot, only crawling one or two pages per request.

Host: 62.13.25.220
/robots.txt
Http Code: 200 Date: Nov 19 06:57:54 Http Version: HTTP/1.0 Size in Bytes: 6702
Referer: -
Agent: Speedy Spider (Entireweb; Beta/1.0; http://www.entireweb.com/about/search_tech/speedyspider/)

[tags]Search Engine, bot, crawler, spider, Entireweb, Speedy Spider [/tags]

BuzzLogic

Sunday, November 19th, 2006

Oh, wait, just when you thought we were done here with research services for the Google impaired, there is yet another one. Buzzlogic has been sending out their crawler for the past few weeks to this blog and by happenstance currently has a private beta for companies.

What is different about BuzzLogic’s crawler though is that it’s revealing a referrer which really, honestly should not be seen in the Web logs. Also, their crawler does not have any identifying information in the User Agent field. Here’s an example.

The questionable referrer, which I am seeing via Sitemeter looks like this:

[file:///data/thumbnailer/work/home-2006-11-17-17:21:16.438/2006-11-19-07:37:13.838-in.html

If I had to guess, however BuzzLogic compiles the collected data into a static HTML file. I’ve seen that static HTML file change day by day, each with a different time/date stamp for each individual instance it hits my Web server.

This is what I see via my Web logs.
Host: 64.34.246.44 (I was only able to connect this to BuzzLogic through a traceroute of the IP address. The BuzzLogic Web server is hosted on what seems to be a completely different hosting provider.)
/wp-content/plugins/sociable/images/reddit.png (This crawler is hitting my image files for some reason.)
Http Code: 200 Date: Nov 19 10:37:14 Http Version: HTTP/1.1 Size in Bytes: 5943
Referer: -
Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.4 (like Gecko)

[tags]bot, crawler, scraper, buzzlogic, brand monitoring services, search engine challenged PR firms [/tags]

Webclipping

Saturday, November 18th, 2006

Yet another monitoring the Web just for you, your brand and your PR department which can’t use Google service, a bot from Webclipping was spied hitting my RSS feeds recently. Clicking around the Webclipping site (which doesn’t look all that hot in Firefox 2.0), the service seems to be similar to other monitoring outfits including brandimensions. (A side note, brandimensions, which I’ve written about before, charmingly has a Flash-only-I-don’t-really-want-to-be-found-by-search-engines homepage. As you can see, I’m not exactly a fan of a service which compiles my content and doesn’t allow me to see the context.)

Host: 38.144.36.19
/blog/blog.rdf
Http Code: 302 Date: Nov 18 18:08:51 Http Version: HTTP/1.1 Size in Bytes: 224
Referer: -
Agent: Mozilla/4.0 (Webclipping.com)

[tags]bot, crawler, scraper, webclipping, brand monitoring services, search engine challenged PR firms [/tags]

Hoopla

Saturday, November 18th, 2006

A still in private beta service Hoopla purports to be “the next big portal that renders other online news and blog services obsolete.” There’s also an accompanying blog, currently with only one entry.

I found Hoopla via my usual discovery method, my Web site logs where the crawler was hitting my RSS feeds. It appears they need to be crawling the Web and blogosphere for a bit in order to collect content for their portal. I can’t tell if the folks behind Hoopla are American and/or German though. It looks like the anonymous WHOIS registration is for an American company, and the language on the Hoopla parked page is definitely colloquial American English, but the crawler is from a German IP.

Host: 82.165.243.217
/blog/blog.rdf
Http Code: 302 Date: Nov 17 12:37:08 Http Version: HTTP/1.1 Size inBytes: 224
Referer: -
Agent: http://www.hoopla.com/; tracker@hoopla.com - Hoopla.com honors
robots.txt; Hoopla.com Tracker; Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.7.6) Gecko/20050402 Firefox/1.0.2

[tags]hoopla, RSS, beta, crawler, portal, portal page, Web 2.0 [/tags]

Three RSS applications - FeedSweep, Fatcast, Wefeelfine

Thursday, November 16th, 2006

FeedSweep provides a way to display syndicated RSS content on your site. So, for example, you could show the cleverhack feed if you really, really wanted to. One gripe, I couldn’t find an Add Feed to FeedSweep button.

<script src="http://www.feedsweep.com/products/feedsweep/producer.aspx?feeds=http%3A%2F%2Fcleverhack%2Ecom%2Ffeed%2F"></script>

Fatcast is an online RSS reader similar to Bloglines. Again, I could not find an Add Feed to Fatcast button. However, the service does allow one to share their list of feeds if they wish, so of course I made one exclusively with cleverhack feeds.

My Fatcast feed list.

Last, but certainly not without emotion is WeFeelFine. Appears to be a reasearch project which searches the blogosphere for words or phrases on how the blogosphere feels. The applet which displays the emotional information looks quite cool (click on the We Feel Fine link on the home page) and allows one to search via demographics - age, sex, location. (Warning, the applet seems to take a bit of memory in Firefox.) Anyone up for reading about how some emo twentysomethings from Seattle feel?

And how did I find WeFeelFine? Their crawler, which didn’t have any identifying info in the User Agent, but the lookup on the IP provided the domain name.

Host: 128.177.11.193
/2006/11/10/rutgers-9-0/
Http Code: 200 Date: Nov 10 00:05:56 Http Version: HTTP/1.1 Size in Bytes: 16477
Referer: -
Agent: Mozilla/4.0
00000000000000000000000000000000000000

[tags] RSS, RSS feeds, RSS Syndication, RSS Readers, RSS Research, FeedSweep, Fatcast, WeFeelFine [/tags]

Identify your (Web) spider

Wednesday, November 15th, 2006

Today Slashdot had a front page article about how to create a Web spider on Linux. Aside from the fact that the subject matter just totally excites my inner nerd, I wanted make a point especially for those who would be writing a spider, bot or crawler for fun and profit.

I have this true story about how, not so long ago, I was a Webmaster. One very busy morning, I had a crawler that was hitting my site and it was annoying the heck out of me as it was a little too aggressive. I really wanted to ban it, but I saw a URL in the User Agent, and so I tracked down the source. The homepage for the bot at the time looked like this and the site it was crawling for wasn’t live yet. At that point, I had a choice - I could just ban the bot and be done with it or allow the bot to run and hope that the not yet live site would someday provide some benefit.

As it turns out, I held my nose and allowed the bot to run. In fact, a few weeks later, it did slow down and was friendlier - so I didn’t mind it as much. The other part to this story is that the site in question went live in April 2006 - and it did show the crawled content.

In other words, if your bot is legit, identify it or face the chance that you could be banned from the very sites you want to crawl. While the shopwiki example isn’t the best example of a parked page, at least I had some information to go on as a Webmaster.

[tags]Spider, Bot, Crawler, User Agent, Webmaster, Web Admin [/tags]

Anti-Spam Update - Knuj0n and Boxbe

Wednesday, November 15th, 2006

Some up and coming anti-spam services I’ve heard about recently…

On the technical side, KnujOn offers a method to help identify the folks sending fraudulent email. As an end user, all you do is register your email address with the service so the address can be whitelisted and then send your email to KnujOn. The father/son team behind KnujOn collects the data and invites Web hosts, Credit Card investigators and law enforcement to use the collected data during investigations. As a bonus, the service sends participants weekly progress reports as to how many fraudulent sites have been taken down.

More information about Knuj0n can be found at Castle Cops and there’s even a Thunderbird extension for the service.

Aside from reporting spam, for inbox protection why not take a look at Boxbe? Boxbe is all about giving you a forwarding email address that you can share with others without the hassle of receiving spam. In order to reach your your pre-existing email address inbox, advertisers have to pay you a price you specify. As a value proposition, Boxbe protects your inbox and pays you for your attention.

The service is not without drawbacks, however. For example, when you set up a profile on Boxbe, you’re asked to divulge interests and other profile data, which Boxbe anonymously shares with advertisers. In addition, there could be problems with senders. If the sender doesn’t want to work with the Boxbe system (either by refusing to complete the sender test or refusing to pay to send to you), the email in question would land in your quarantine.

[tags] Email, Email Deliverability, Spam, Email, KnujOn, Boxbe [/tags]

Jesus 2.0

Tuesday, November 14th, 2006

MyCCM bills itself as social networking for Christians and it bears the hallmarks of a true Web 2.0 space - RSS feeds and RSS search capabilities, blogs, podcasts, personal profiles and the ability to join a community.

I had a referrer from the site and I had to go and click around. The site design looks fine. But I have one question about the site, it seems that you can see a great part of the site without needing to log in. I was able to click around to each section of the site - myRSS, Blogs, Tags, Search, Groups. Community and see the section pages in addition to searching for profiles. To me, it appears that this site allows way more unfettered unlogged in access than a MySpace, Facebook or Linked In and some of those profiles looked young, even though the registration process doesn’t allow birthdays later than 1993.

[tags]Web 2.0, MyCCM, online communities [/tags]

I am Tablet Worthy

Monday, November 13th, 2006

Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Tablet PC 1.7; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)

[tags]User Agent, Hardware, Tablet PC [/tags]

ZapTXT

Sunday, November 12th, 2006

I still have hope for the killer munged RSS app. What do I mean by munged? I mean RSS that has been somehow used or manipulated to do something else…

With this idea in mind, I think that ZapTXT holds some interesting possibilities. ZapTXT is a service which will monitor RSS feeds to you and send keyword search alerts to your email, IM or phone. While there’s both Google Alerts and Yahoo Alerts, along with some other services that does RSS to Email or RSS to IM or RSS to Phone, I’m hoping that ZapTXT can capitalize on its promise and continue to innovate.

As for look and feel, the ZapTXT site is extremely well done. I especially like the hidden div sign up process on the front page. How clever is that? Going through the site, the look and feel remains (for an example, take a look at the widgets page). Markup is XHTML transitional with some very clever uses of CSS. The only feedback I have for the site, really, is the fact that their images need alt tags and maybe, just maybe I would put a little more text in the page footer.

Good job ZapTXT folks, and after I post this I should be able to see how fast your service works.

Host: 209.126.131.181
/feed/
Http Code: 200 Date: Nov 12 21:53:05 Http Version: HTTP/1.1 Size in Bytes: 30833
Referer: -
Agent: ZapTXT bot; http://zaptxt.com; support@zaptxt-inc.com

[tags] RSS, RSS search, RSS Alerts, RSS to Email, RSS to Phone, RSS to IM, ZapTXT, Web 2.0 [/tags]

Podcast via cellphone - Podlinez and Fonpods

Sunday, November 12th, 2006

TechCrunch covered two new podcast via cellphone services - Podlinez and Fonpods and out of sheer curiosity, I just had to investigate how such services would be executed. What’s interesting is the difference in the technology running each site.

Podlinez, the service which provides a phone number for each individual podcast (the cleverhack podcast number is +1 (818) 688-2726), is built on a LAMP infrastructure and the HTML markup is XHTML strict. As you can see, the site design is very simple (there’s a front page and podcast directory pages) and as the service grows, I would hope the design would become a little more sophisticated. No associated blog or FAQ or any other extraneous pages. There’s no description meta tags nor a descriptive title tag either. As you can see, the User Agent is perl based - and a URL to the Podlinez site would be useful.

Host: 72.232.61.250
/category/podcasts/feed/
Http Code: 200 Date: Nov 12 11:52:18 Http Version: HTTP/1.1 Size in Bytes: 34556
Referer: -
Agent: CheezIt/0.1 libwww-perl/5.805

On the other hand, there’s Fonpods, which I think I’ve actually seen a few months ago. To use Fonpods, you create an account on the Fonpods site, choose the podcasts you want to listen to and then dial into the central (712) 432-3030 number (For reference, here are the cleverhack podcast codes).

As for technology, Fonpods uses the ASP .Net platform, and while they are using XHTML Transitional markup, their markup is commented. As a plus, the site looks pretty good (with the exceptions of the javascript button on the front page and the misaligned search button) in Firefox - it looks like most of the pages have text within images - which is fine for layout but you lose something in terms of SEO. If I were them, I would delete the extra meta tags in their page mark up…and add a meta description tag. Since Fonpods is dependent on one main phone number, I’d make sure that phone number was everywhere - for example in the page title tags, on the page banner alt tags, in page copy and on the page footer.

Host: 67.134.137.164
/category/podcasts/feed/
Http Code: 200 Date: Nov 12 12:12:04 Http Version: HTTP/1.1 Size in Bytes: 34556
Referer: -
Agent: Fonpods

[tags] Podcast, Podcasting, Podcasts on cellphones, Podlinez, Fonpods, Web 2.0 [/tags]

ThePort Network

Saturday, November 11th, 2006

ThePort Network, a company based in Atlanta, Georgia offers a turn-key hosted Web 2.0 platform including a Web-based Newsreader, Desktop Newsreader, Blog Publishing, Social Networking Tools and RSS Desktop Alerts.

The say they have a few corporate clients, including a certain Philadelphia football team. At this writing, I can’t say anything about their Web 2.0 tools, since a demo isn’t readily available on their site - which isn’t very Web 2.0, is it?

The most ironic thing about ThePort though, is how I found the company. Their feed aggregator user agent appeared in my logs, so of course I had to take a look to see what they were about. I went to their corporate site front page, only to discover that they have no copy on their very important front page and instead are using Flash to mock up a tag cloud. Aside from the usability issues, that’s some terrible SEO. For a start, get some copy on the front page. (And instead of Flash, if they need to mock up a tag cloud they could have used an image and used image maps with some ALT tags for the links.)

I sure as heck hope they’re not trying to pull traffic from search engines.

Host: 63.111.12.32
/feed/

Http Code: 304 Date: Nov 11 11:25:48 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: ThePort Web/1.0; subscribers 1

[tags] ThePort, ThePort Network, Web 2.0, Web 2.0 tools, Corporate Web 2.0 tools, turn-key, SEO [/tags]