August 28, 2002
How Do You Find the Relevant Weblog?

How would you go about finding a relevant weblog? This is a hard question, and so people are starting to think about building tools to answer it. Yet another example of the possibility that in the information age, the librarians (broadly construed) will rule the world because they will be the only people who can find anything.


The BlogMD Initiative: ...At present, numerous applications are available in the weblog world which provide interesting and useful methods of tracking weblogs and help users perform a vital sifting function to find the blogs that interest them most. Some tools track when a weblog was last updated (weblogs.com) ; some track the most popular Internet links currently being pointed at by weblogs (Blogdex) and more recently, the Blogosphere Ecosystem at The Truth Laid Bear has tracked the links passing between weblogs (as does the similar, but more powerful Myelin Ecosystem.) All of these applications are, at their core, doing the same thing. One way or another, they are gathering information about weblogs --- metadata --- storing it, analyzing it, and presenting their results on a web page.The guiding principle behind the BlogMD initiative is that by creating standards in the weblog metadata "problem space", we can enable greater collaboration and interaction between existing applications, as well as paving the way for future, currently unforeseen metadata applications by reducing or eliminating much of the redundant, "reinventing the wheel" work currently involved in creating a new weblog metadata application...

A Proposal: The Blog Metadata Initiative (BlogMD)

Part I: Introduction and Project Description

Background:

At present, numerous applications are available in the weblog world which provide interesting and useful methods of tracking weblogs and help users perform a vital sifting function to find the blogs that interest them most. Some tools track when a weblog was last updated (weblogs.com) ; some track the most popular Internet links currently being pointed at by weblogs (Blogdex) and more recently, the Blogosphere Ecosystem at The Truth Laid Bear has tracked the links passing between weblogs (as does the similar, but more powerful Myelin Ecosystem.) All of these applications are, at their core, doing the same thing. One way or another, they are gathering information about weblogs --- metadata --- storing it, analyzing it, and presenting their results on a web page.

The guiding principle behind the BlogMD initiative is that by creating standards in the weblog metadata "problem space", we can enable greater collaboration and interaction between existing applications, as well as paving the way for future, currently unforeseen metadata applications by reducing or eliminating much of the redundant, "reinventing the wheel" work currently involved in creating a new weblog metadata application.

Objectives:

- To define standard data structures and API's for the storage and exchange of weblog metadata, and in so doing, enable the creation of a distributed network of metadata servers. These servers can then provide metadata-enabled services on their own hosted pages, or can support services provided by client applications elsewhere.

- To define a standard BlogMD ping API, capable of allowing a weblog to transmit metadata such as last updated time, link information, and blog entry excerpts; and to work with CMS developers to encourage the incorporation of ping functionality into CMS platforms.

- To work with existing metadata application providers to ensure that BlogMD data structures and API's are defined in such a way that they support the needs of existing applications, and to encourage the uptake and incorporation of the standard APIs into existing metadata applications.

- To create an initial open-licensed metadata server implementation to be used as a ‘reference platform’ and starting framework for future server implementations.

Expected Benefits

But more specifically, what benefits will the BlogMD initiative provide?

For weblog readers & authors:

SIMPLIFY AND ENHANCE THE ABILITY OF READERS TO FIND WEBLOGS BY GENRE, LOCATION, LANGUAGE, OR OTHER CATEGORY-TYPE CHARACTERISTICS: While there do currently exist metadata applications which attempt to categorize weblogs by subject, or identify them by location, these applications generally cover only a self-selected subset of weblogs, and suffer from the problem that these characteristics must be entered by the weblog author manually. The BlogMD APIs will provide the ability for CMS developers to allow weblog authors to describe their blogs using a standard slate of characteristics, and then publish – automatically – those traits to as many metadata applications as they wish without any additional effort. Authors win: they no longer have to enter such information multiple times on multiple web forms. Readers win: they gain powerful tools for searching a wider range of weblogs for the characteristics that interest them. And authors win again: more readers who ‘self-select’ for the author’s areas of interest show up on their front page.

FACILITATE THE CREATION OF MICRO-ECOSYSTEMS OF WEBLOGS: By creating a freely distributable standard reference platform explicitly designed to make setting up a basic metadata application simple, the BlogMD initiative will encourage the creation of metadata applications to serve sub-communities, or “micro-ecosystems” of weblogs that previously would not have had the resources or skills to develop their own applications. The recently formed Christian blog site, blogs4God, provides an excellent example of the kind of sub-community that BlogMD will enable in this fashion. At the same time, the data exchange architecture of the BlogMD APIs will also allow metadata applications focusing on specific sub-communities to easily exchange data with other applications: thereby allowing specialization without forcing isolation.

For metadata application developers:

ENABLE THE RETRIEVAL OF METADATA FROM MULTIPLE SOURCES VIA A SINGLE COMMON API
: Today, there is no common method available for retrieving metadata from existing applications; some apps make their information available in various custom formats, others provide no access at all except via their web pages. A developer seeking to implement a client application which accesses metadata information from disparate data sources is forced to develop custom interfaces for each server it needs to pull information from. The BlogMD initiative will drastically reduce the need for custom interfaces by providing a single, common API set to meet the majority of metadata application needs.

ENABLE METADATA APPLICATIONS TO RECEIVE DATA DIRECTLY FROM WEBLOGS, RATHER THAN BEING FORCED TO ‘TRAWL’ PAGES: The most efficient method for a metadata app to obtain its information is by receiving it directly from weblogs themselves via a ping. However, no individual application developer is likely to be able to motivate CMS developers to create a custom ping
functionality to support their single application; and therefore, most such applications are forced to incorporate routines to ‘trawl’ web pages – or are simply never built at all. The BlogMD initiative, by creating a standard ping API supported by multiple CMS developers, will remove this “barrier to entry” from the weblog metadata application space.

ENABLE ENTRY-BASED METADATA APPLICATIONS: At present, few metadata applications exist that actually provide information on individual entries within a weblog, rather than simply general characteristics of the entire weblog itself. By including the capability to provide metadata about specific weblog entries in the BlogMD ping API (such as entry title, category, and excerpt) , an application domain that was previously barely addressed will be flung wide open: metadata applications that present information on weblog posts will become straightforward to develop.

Initiative Process & Governance

The guiding principle of the BlogMD initiative will be openness: wherever possible input will be solicited from any and all interested parties. The project is at its heart meant to be a collaborative effort, drawing upon the talents of the entire weblogging community. To serve this end, a publicly accessible forum has been established which will serve as the primary communications mechanism for the initiative.

At some point, however, final decisions must be made if standards are to be established. Therefore, a governing board is being established that will have final decision-making powers over all BlogMD efforts. The weblog community, of course, may choose to ignore any and all decisions of our little committee -- which should keep the board well motivated to actually make decisions that make sense and serve the community interest, rather than playing politics. We can, of course, only hope.

That said, responsibilities of board members will include:

- Leveraging their own experience & skills to contribute to the discussion and debate on the work of the initiative
- Voting on all final standards decisions (data model; API specifications)
- Voting on strategic decisions to determine the direction of the BlogMD initiative
- Publicly representing the BlogMD initiative, and evangelizing BlogMD standards.

At present time, the following individuals have been joined the board:

N.Z. Bear of the Blogosphere Ecosystem and The Truth Laid Bear
Phillip Pearson of the Myelin Ecosystem
Dean Peters of blogs4God.com and healyourchurchwebsite.com

In addition, invitations have been sent to several additional individuals who have made significant contributions to the weblog world to join the board; we are awaiting their replies.

N.Z. Bear will act as overall coordinator for the initiative’s efforts. His intention is to be a point of contact and coordination, but to defer all significant decision-making to the board (which will in turn defer all decisionmaking to the community wherever possible). (N.Z. Bear intends to adopt the title Lead Cat Herder, which will likely prove a far more accurate description than ‘Team Lead’ or any other such grandiose appellation).

Suggestions for additional board members who would bring appropriate experience to the table are welcome. Once the initial board is determined and the initiative is publicly announced, additions to the board will be approved by a majority vote of the existing board members.

Next Steps

1) Finalize BlogMD founding board membership
2) Identify core team to drive the discussion and development of the BlogMD data model specification
3) Draft & publish for public comment a completed draft version of the BlogMD data model specification (expanding/revising my initial proposal below)
4) Identify core team to drive the discussion and development of the BlogMD API set
5) Draft & publish for public comment a completed draft version of the BlogMD API set
6) Identify a core team to begin development on the reference server application

Part II: Preliminary Design

Conceptual Data Model

Significant discussion will be required to hammer out a data model for metadata which all interested parties will be able to agree upon. (And in fact, such a perfect outcome is unlikely; compromise is inevitable). However, the following basic structure is proposed as a starting point for discussion:

Core Weblog Attributes:
-Name
-URL
-Categories (A standard taxonomy of categories would be extremely useful --- yet another contentious discussion, to be sure. Intent is that a weblog should be able to be listed under multiple categories)
-Language
-Physical Location
-Weblog CMS being used
-Owner name
-Owner e-mail
-Owner gender
-Owner age
-Date/Time weblog was last updated
-Date/Time weblog last pinged the server (provided metadata actively through one of the APIs)
-Date/Time weblogs was last scanned by the server (see servers below)

Many of these fields should be optional; only the core fields such as Name, URL, and update tracking should be mandatory, allowing weblog authors to decide for themselves what information to share about themselves and their weblog.

In addition to the core weblog attributes, a standard structure for storing links between weblogs is required:

Link Attributes:
- Weblog where the link is found
- URL of the destination link
- Date/Time link was last updated
- Date/Time link data was last received in a ping to the server (provided metadata actively through one of the APIs)
- Date/Time link data was last updated via a scan by the server (see servers below)

And finally, a data structure for storing metadata about the entries within a weblog:

Entry Attributes:
- Weblog where the post is found
- Entry Title
- Entry categories
- Entry excerpt
- TrackBack URL of another entry this entry references (from the TrackBack specification being promoted by Moveable Type)
- Date/Time entry was last updated
- Date/Time entry data was last received in a ping to the server (provided metadata actively through one of the APIs)
- Date/Time entry data was last updated via a scan by the server (see servers below)


BlogMD Servers: Conceptual Design

A BlogMD server is defined as any application which will support one or more of the to-be-defined BlogMD APIs which leverage the common metadata model. Metadata server owners should be allowed the freedom to choose implement the complete API set or only those APIs which they deem valuable to their application.

The BlogMD API set should provide API's to address the following types of requests (this list is representative, but like the data model, is intended as a starting point for discussion):

- Return basic metadata for a single weblog in a query by name or URL
- Return basic metadata for one or many weblogs in a query by other metadata attributes such as date/time last updated or category
- Return metadata for one or many links based on query parameters
- Return metadata for one or many posts based on query parameters
- Accept inbound metadata for a weblog, its posts, and/or its links (active ping). The model here is obviously www.weblogs.com; we will be following the trail blazed by the Radio Userland team in this area.

In addition, a separate set of “bulk transfer” APIs may be desirable to enable efficient large-scale replication of data between BlogMD servers.

A BlogMD server can populate its database of weblog metadata in any way it chooses. One method, obviously, is to simply accept inbound pings from weblogs. Certainly in the near-term, however -- and possibly forever -- it is obvious that "crawlers" will be required to retrieve metadata from weblogs which are not actively pinging.

A highly desirable feature to incorporate into the BlogMD API and server architecture would be to allow for the transparent re-routing of API requests from clients in the case that the queried server does not have information on the requested weblog. Peer-to-peer networks such as Gnutella might serve as a model. The benefit of such an approach would be to truly enable a distributed network of servers which work collaboratively: each server can store locally the information it deems most relevant to its application, but it can always retrieve information outside its preferred domain from other servers.

One implementation of this concept would be when a BlogMD server is attempting to retrieve metadata for a list of weblogs, rather than blindly crawling their pages, it could query the server network to see, for each blog, whether another server had recently crawled it, and if so, simply receive the metadata from the BlogMD network as opposed to re-crawling the page.

Within the BlogMD architecture, therefore, data can be exchanged between servers via two methods: scheduled replication (via the bulk transfer APIs) or on-demand (via the rerouting capability of the standard query APIs).

The BlogMD server specification is not intended to place any restrictions or obligations on a server owner as to whether they make all, part, or none of the metadata they have gathered available through the BlogMD APIs. (Although if they're going to provide none, it is arguable that they are not technically a server). For example: a server owner might decide that they will focus on a particular sub-community of weblogs, and scan them every four hours, displaying the results immediately on the server's pages. The owner might decide, however, to only provide data via the BlogMD APIs which is twelve hours old; thereby preserving their 'competitive advantage' in that sub-community. (There would be, of course, nothing to stop another server owner from executing the same scan, of course).

Finally, security concerns should be addressed seriously. The BlogMD APIs should support (but not require) authentication, so that BlogMD server owners can, if they choose, restrict access to some or all of their metadata.


BlogMD Clients: Conceptual Design

A BlogMD client is simply any application that uses the APIs to exchange data with a BlogMD server. In this way, of course, most BlogMD servers will also themselves be BlogMD clients.

Clients could also be anything from other web sites, to desktop applications, to JavaScript code to be embedded in web pages. And of course, CMS tools can also act as BlogMD clients: primarily by utilizing the ping API, but potentially leveraging other APIs as well.

Future generations of BlogMD-enabled CMS tools could actually use the APIs to incorporate functionality currently found scattered in various blog-tracking clients. For example, a CMS tool could allow a weblog author to specify their favorite other weblogs, and then present the author with those weblogs’ most recent entries upon startup: ready to be linked to and commented upon.

Technology Notes

The obvious choice for the definition of the metadata structure would seem to be XML.

Similarly, XML-RPC and SOAP would provide widely accepted protocols on which to base the BlogMD API set.

Part III: Conceptual Design for the Metadata Application Reference Platform

Goals

- Create a freely distributable BlogMD server which implements basic server functionality such as web-page trawling, analysis of data and display via HTML, and full implementation of all BlogMD APIs - Release the reference platform as open source, with the full expectation and desire for it to be further enhanced and extended by the weblogging community

We should measure the success of the reference platform as follows: a user with basic web development knowledge should be able to download, install, and configure the server to both trawl a desired list of websites and accept pings, displaying the results on their web page. Total time / effort involved should be no more than a few hours, and no coding should be required for a ‘vanilla’ implementation.

Technology

Technology decisions will be a key discussion point for the development of the reference platform, but the following is proposed as a starting point for discussion:

- UNIX-based platform using MySQL database
- Core functionality to be implemented using a common scripting language such as Perl, PHP, or Python
- Display functionality to be template-driven utilizing CSS


Architecture

The reference platform should be designed in a highly modular fashion. The following key modules are suggested for the initial implementation:

Trawler: Accepts a list of weblogs names and URLs, and parses the HTML of each weblog, retrieving all metadata possible and storing it in the server database. Particular care should be taken to build the trawler as a well-behaved application: weblogs should be provided ways (via robots.txt or otherwise) to prevent their pages from being trawled, and trawling speed
should be configurable to ensure servers are not flooded.

Parser: Takes the raw data retrieved by the trawler and processes it to refine it for further use. For example, the trawler might retrieve two URLs: “http://www.truthlaidbear.com” and “http://truthlaidbear.com”. The parser should follow well-defined rules to identify that these URLs in fact correspond to the same weblog.

Analyzer: Performs analysis on the data gathered to generate summary and relationship statistics. The analyzer is the module which will count the number of ‘outbound’ and ‘inbound’ links for a weblog, for example, replicating the core logic of the existing Ecosystem metadata applications.

API Layer: Implements the full set of BlogMD APIs; accepting inbound pings from weblogs and responding to data requests from BlogMD clients.

Search: Implements database search functionality to support queries via the BlogMD APIs.

Display: Generates web pages via HTML/CSS to display weblog metadata. For the initial implementation, supported display pages might include:
- Recently updated list (similar to www.weblogs.com)
- Inbound & outbound link lists (similar to the Ecosystem projects)
- Search (allowing a reader to search weblogs by any available metadata attribute)

The precise scope of functionality to be addressed in the initial release of the reference platform will be confirmed in the initial stages of the project.
 

Posted by DeLong at August 28, 2002 09:34 AM | Trackback

Email this entry
Email a link to this entry to:


Your email address:


Message (optional):


Comments
Post a comment
Name:


Email Address:


URL:


Comments:


Remember info?