April 06, 2004

Notes: Google Synergies

Let me, for one, be among the first to welcome our new Synergetic Google Overlords:

Delivered-To: dfarber+@ux13.sp.cs.cmu.edu
Date: Tue, 06 Apr 2004 13:56:09 +0530
From: Suresh Ramasubramanian
Subject: Interesting speculation on the tech behind gmail
To: Dave Farber


April 04, 2004: The Secret Source of Google's Power

Much is being written about Gmail, Google's new free webmail system. There's something deeper to learn about Google from this product than the initial reaction to the product features, however. Ignore for a moment the observations about Google leapfrogging their competitors with more user value and a new feature or two. Or Google diversifying away from search into other applications; they've been doing that for a while. Or the privacy red herring.

No, the story is about seemingly incremental features that are actually massively expensive for others to match, and the platform that Google is building which makes it cheaper and easier for them to develop and run web-scale applications than anyone else.

I've written before about Google's snippet service, which required that they store the entire web in RAM. All so they could generate a slightly better page excerpt than other search engines.

Google has taken the last 10 years of systems software research out of university labs, and built their own proprietary, production quality system. What is this platform that Google is building? It's a distributed computing platform that can manage web-scale datasets on 100,000 node server clusters. It includes a petabyte, distributed, fault tolerant filesystem, distributed RPC code, probably network shared memory and process migration. And a datacenter management system which lets a handful of ops engineers effectively run 100,000 servers. Any of these projects could be the sole focus of a startup.

Speculation: Gmail's Architecture and Economics

Let's make some guesses about how one might build a Gmail.

Hotmail has 60 million users. Gmail's design should be comparable, and should scale to 100 million users. It will only have to support a couple of million in the first year though.

The most obvious challenge is the storage. You can't lose people's email, and you don't want to ever be down, so data has to be replicated. RAID is no good; when a disk fails, a human needs to replace the bad disk, or there is risk of data loss if more disks fail. One imagines the old ENIAC technician running up and down the isles of Google's data center with a shopping cart full of spare disk drives instead of vacuum tubes. RAID also requires more expensive hardware -- at least the hot swap drive trays. And RAID doesn't handle high availability at the server level anyway.

No. Google has 100,000 servers. [nytimes] If a server/disk dies, they leave it dead in the rack, to be reclaimed/replaced later. Hardware failures need to be instantly routed around by software.

Google has built their own distributed, fault-tolerant, petabyte filesystem, the Google Filesystem. This is ideal for the job. Say GFS replicates user email in three places; if a disk or a server dies, GFS can automatically make a new copy from one of the remaining two. Compress the email for a 3:1 storage win, then store user's email in three locations, and their raw storage need is approximately equivalent to the user's mail size.

The Gmail servers wouldn't be top-heavy with lots of disk. They need the CPU for indexing and page view serving anyway. No fancy RAID card or hot-swap trays, just 1-2 disks per 1U server.

It's straightforward to spreadsheet out the economics of the service, taking into account average storage per user, cost of the servers, and monetization per user per year. Google apparently puts the operational cost of storage at $2 per gigabyte. My napkin math comes up with numbers in the same ballpark. I would assume the yearly monetized value of a webmail user to be in the $1-10 range.

Cheap Hardware

Here's an anecdote to illustrate how far Google's cultural approach to hardware cost is different from the norm, and what it means as a component of their competitive advantage.

In a previous job I specified 40 moderately-priced servers to run a new internet search site we were developing. The ops team overrode me; they wanted 6 more expensive servers, since they said it would be easier to manage 6 machines than 40.

What this does is raise the cost of a CPU second. We had engineers that could imagine algorithms that would give marginally better search results, but if the algorithm was 10 times slower than the current code, ops would have to add 10X the number of machines to the datacenter. If you've already got $20 million invested in a modest collection of Suns, going 10X to run some fancier code is not an option.

Google has 100,000 servers.

Any sane ops person would rather go with a fancy $5000 server than a bare $500 motherboard plus disks sitting exposed on a tray. But that's a 10X difference to the cost of a CPU cycle. And this frees up the algorithm designers to invent better stuff.

Without cheap CPU cycles, the coders won't even consider algorithms that the Google guys are deploying. They're just too expensive to run.

Google doesn't deploy bare motherboards on exposed trays anymore; they're on at least the fourth iteration of their cheap hardware platform. Google now has an institutional competence building and maintaining servers that cost a lot less than the servers everyone else is using. And they do it with fewer people.

Think of the little internal factory they must have to deploy servers, and the level of automation needed to run that many boxes. Either network boot or a production line to pre-install disk images. Servers that self-configure on boot to determine their network config and load the latest rev of the software they'll be running. Normal datacenter ops practices don't scale to what Google has.What are all those OS Researchers doing at Google?

Rob Pike has gone to Google. Yes, that Rob Pike -- the OS researcher, the member of the original Unix team from Bell Labs. This guy isn't just some labs hood ornament; he writes code, lots of it. Big chunks of whole new operating systems like Plan 9.

Look at the depth of the research background of the Google employees in OS, networking, and distributed systems. Compiler Optimization. Thread migration. Distributed shared memory.

I'm a sucker for cool OS research. Browsing papers from Google employees about distributed systems, thread migration, network shared memory, GFS, makes me feel like a kid in Tomorrowland wondering when we're going to Mars. Wouldn't it be great, as an engineer, to have production versions of all this great research.

Google engineers do!

Competitive Advantage

Google is a company that has built a single very large, custom computer. It's running their own cluster operating system. They make their big computer even bigger and faster each month, while lowering the cost of CPU cycles. It's looking more like a general purpose platform than a cluster optimized for a single application.

While competitors are targeting the individual applications Google has deployed, Google is building a massive, general purpose computing platform for web-scale programming.

This computer is running the world's top search engine, a social networking service, a shopping price comparison engine, a new email service, and a local search/yellow pages engine. What will they do next with the world's biggest computer and most advanced operating system?

Posted by DeLong at April 6, 2004 10:45 AM | TrackBack | | Other weblogs commenting on this post

Uh, Brad... Gmail was an April Fool's joke. Seriously.

Posted by: manyoso on April 6, 2004 10:47 AM


No, it wasn't. The lunar Google office was.


Posted by: V from VJ on April 6, 2004 10:57 AM


I think that damn thing will wake up and then we'll all be in trouble.

Adrian CF

Posted by: Adrian Spidle on April 6, 2004 11:07 AM


If all this is true, then why does Google's usenet news group service suck so much?

Posted by: Fabio on April 6, 2004 11:28 AM


Interesting take. I was thinking that the Google IPO would be an overblown costly hype event but considering the web based computing expertise you've described, I'm buying.

Posted by: jpmist on April 6, 2004 11:52 AM


So when do they take on the NSA?

I'm not sure the NSA can muster more cycles than Google.

Posted by: Charles M on April 6, 2004 12:46 PM


A few notes:

1. Gmail was not a joke.

2. Google Groups sucks because Usenet itself sucks. GIGO.

3. Suresh works for a service which is a competitor to GMail, and knows very much of what he speaks.

Posted by: Doctor Memory on April 6, 2004 02:59 PM


My god, I don't know much about all of technical stuff, but one take-away from the heavy R&D that Google has going into OS research leads me to believe that they could be positioning themselves to market their proprietary OS, if they can scale it down, for servers and workstations. This would be a bold and awesome challenge to Microsoft in that market that I think many would welcome. That would be wild. Go capitalism!

Posted by: Bubb Rubb on April 6, 2004 03:47 PM


I've recently had the [mis]fortune to manage a large modern data center that uses the "open computing" model after having spent most of my career in a mainframe environment.

My immediate reaction was that there is a HUGE opportunity for cost and quality improvement, along the lines of what Google appears to have acheived. Everything in this post makes perfect sense to me.

The "open" computing model brought with it many benefits. It allowed companies to concentrate on only one type of component in the data center: disk arrays, servers, switches, DBs, operations tools, etc. It allowed for revolutionary innovations such as appliances and SANs. It lowered the barriers to entry for vendors, thus increasing competition. And as a result it helped to greatly lower the cost of computing and drastically increase the quality of each component.

But on the down side open computing meant that it was up to the user of the technology to integrate everything. As applications have grown exponentially the complexities of integration have so grown. This creates two problems.

First, since no vendor owns the end-to-end solution, product optimizations like those that Google is described as implementing just don't happen. The optimizations require coordination of all end-to-end components, including the application. And the R&D costs are in the tens of millions if not hundreds of millions. No single component vendor could make profits from such a solution, and no consortium would forgoe enough of their own R&D budget to contribute enough to create a combined large scale solution.

Second, while in the open computing model cost has been driven down and component quality has been drastically improved, the overall quality of data center installations has suffered due to the inability of most data centers to manage the complexity required.

In the mainframe days most of the complex system integration was done by the main frame vendor, and was included as part of the shipped solution. Yes, there was some complexity in configuring mainframe systems (as I used to do), but that was simplicity compared to today's applications. As a result, you needed just a handful of guru-level system architects at the vendor who really understood the end-to-end integration to make it work. Now in the open computing model every installation needs a guru-level architect to pull it all together. There just aren't enough of these people to go around. So compromises are made in hiring, and quality suffers. Frequently masses of technical people are thrown at a system upgrade or installation at the very end of the project to solve all the unforeseen problems and get the darn thing to work. In such an environment, people just don't have the time or energy to think of the optimizations that Google has invented.

(For a while the thought was that system integrators (like EDS) or application service providers (ASPs) would perform this work. And they did offer such services, but usually at very high cost and rarely meeting quality goals. During the dot.com era these services were popular, but their poor economics meant that most ASPs went the way of Marc Andreesen's Loudmouth after the dot.com bust.)

(The other common problem that Google has obviously solved is that most companies separate development and operations and the two usually develop antagonistic relationships. As Google's solution clearly combines contributions from both the development and operations sides, they apparently have avoided the feuding organizational model.)

Posted by: Z on April 6, 2004 03:51 PM


Bubb Rubb:

You're on to something, but slightly off track. The Google OS, as described, was designed to meet the unique goals of an application with ultra-high transcation volumns and disk search rates. Forget about it competing with desktop or small server operating systems, as those applications have completely different requirements.

However, if Google's application MAY have a huge opportunity for selling to high end operators like AOL, EBay, Visa, WalMart, NYSE, the travel industry, and of course DoD and DHS. I say MAY because a lot depends on whether it was designed with resale in mind. It's one thing to build an in-house application. It's a much bigger project to design the application and build the accompanying infrastructure necessary to package the application and install it and support it elsewhere. Even with the best software development practices, the current Google application probably has countless instances where the design assumes that it will only be used in Google. And if that is the case, the project necessary to make Google's OS reusable elsewhere would be massive.

Posted by: z on April 6, 2004 04:02 PM


Y'all might want to take a look at how oracle has been quietly transforming itself from a database to a distributed OS/compute platform ... with support.

Posted by: cak on April 6, 2004 05:42 PM


I'm wowwed.

I hope it's well fed, and doesn't get angry.

Posted by: andrew on April 7, 2004 06:56 AM


This looks to me like the first step in extending the cluster out to the internet at large.

If your management costs are really low, and you don't care if hardware goes offline periodically (or possibly even often), why should you pay for hardware at all? Why not just use all of the free space on everybody's machines connected to the internet already?

There's no reason why a Google OS couldn't automatically allocate all unused storage space on a desktop machine (which might be hundreds of GB) to the GoogleDistributedStorageCluster, and give it back when you ask for it. The Google Toolbar could conceivably be a testbed for this.

Posted by: Adam Fields on April 8, 2004 09:31 AM


Google doesn't have to package the OS and sell it as a product. Customers will just outsource their computing to Google. As Moore's law continues, the cost of computing power decreases exponentially, eventually reaching the point where Google can afford to own all the computing power the whole world can possibly use.

Yes, computer usage is observed to expand to utilize all available resources, but not at an exponential rate. The growth of the use of computers is limit by human factors: How fast can people learn to use the programs that are available? How fast can we create the software to do the things we aren't already doing on computers?

Posted by: Warren on April 8, 2004 04:02 PM


Find all homes for sale in Las Vegas, Nevada.

Posted by: Las Vegas Homes & Real Estate on April 20, 2004 03:51 PM


well google rox!!

Posted by: Mike Jayx on May 12, 2004 03:17 PM


Find the top online casinos to play and win some big money

Posted by: online casino portal to online casinos resources on May 30, 2004 07:58 AM


i want my gmail

Posted by: anne on June 12, 2004 02:27 PM


Online Casino Directory

Posted by: Online Casino on June 23, 2004 03:54 AM


Very informed and interesting comments!

Posted by: katty on July 21, 2004 06:33 AM


Very informed and interesting comments!

Posted by: katty on July 21, 2004 06:33 AM


Post a comment