February 23, 2005

Polluting the Web-Structure Information Commons...

David Sifry's Technorati is living in difficult times:

Sifry's Alerts: For the past few days, we've experienced a bit of a slowdown in the timeliness of our data. To give you an idea, our normal median time between being pinged by a blog and having the data available in our index is under 7 minutes. Recently it's been running around several hours.

Unfortunately, a good deal of this is attributable to the increase of spam that's coming at us. The growing number of link farms creates a much greater load on our spiders. Even worse, when spam makes it into our databases, we need to pause our spiders while take explicit steps to purge the spam. This is a time-consuming and complicated process. Also, some of our ancillary systems, like correctly updated link counts, have taken a hit as we work through these issues. I'm sorry if your blog counts haven't been updating recently, we're working on it diligently.

For an economist, this is absolutely fascinating. There is an underlying resource here: the decision of a human being that such-and-such a webpage was worth linking to is a valuable and useful piece of information. There are businesses (Google and Technorati) that grow up to harvest, repackage, and make money off this information. And then there are the people--comment spammers, link spammers, trackback spammers, link-farm creators, et cetera--who see an economic edge from setting up internet robots to pollute the underlying web-structure information stream.

The standard economist's way of dealing with all problems is to advise (i) setting up a system of property rights so that (ii) someone controls each resource of value and make sure that (iii) that someone has the incentives to properly husband the resource and ensure it finds its way to its most valuable use. But when the valuable commodity is the indicator of human attention that is the underlying structure of the web, it is not at all clear how this is to be accomplished.

I really wish I knew more about the internal procedures and thoughts of Google, Ask, MSN Search, and so forth--because they have the strongest incentive of all to clean the data flow from the web-structure information commons of pollution. Yet they are not doing as good a job as I would expect, even with all the incentive in the world.

