Nicholas Weaver writes:
The price for Google's raw storage... for those who are curious: Via EPIA-M Motherboard: $130; Small cube case + PSU: $ 70; Memory: $ 50; 4x 200 GB OEM drives (7.2k RPM) $114 x 4--for a total system cost of $700 for 800 GB of disk, with the CPU and network to take advantage of it. I'm not including the cost of the switch itself, but with 48 port 100 Mbps switches in the $1000 range, internal network cost is almost 0.
Now google has some price advantage here, but not THAT much. But for the sake of argument, lets assume Google is also at $1/GB for the raw disk and compute, rather than $.50/GB. Amortized over a 3 year lifespan, thats $.33/GB/year in hardware cost. The power budget is also low: assume 50W for the CPU/memory and 25W for each disk gives a total power budget of 150W/node. At $.20/KWH (power + cooling costs) thats $262/node/year, or again $.33/GB/year in operating power.
Thus, it really DOES cost Google less than $1/GB/year for disk. But the cost for anyone who wants just lots of disk could be the same. So what's Google's secret?
Their secret is the ability to create reliable STORAGE out of such a system. It takes 3 GB of disk (plus a small amount extra for failover) for google to extract a viable GB of reliable storage, as their filesystem is triply redundant. But this means it still only costs Google $2 per GB year of reliable storage. Another factor is that the minimum allocation size for Google's filesystem is 64 MB. Thus if gmail uses a single file per user (a simple structure which has substantial benefits), and most users use a single block, that means it only costs $.12/user/year in storage.
Thus it only takes the average Gmail user clicking on ONE add a year to pay for the cost of associated storage.
64 MB as the minimum filesize allocation in the Google filesystem? I remember when my entire hard disk was 5 MB... and I'm not that old.
Posted by DeLong at June 21, 2004 03:09 PM | TrackBack | | Other weblogs commenting on this postYou are too;-} Tomorrow's me' birthday.
How about a 128k mac? Believe it or not, but I have a working ibm vt100 terminal with a working 300 baud modem. I love those green characters. So know I can have the entire Library of Congress in my own little array of storage for under 25k. Wow, information is getting cheap, thank heavens understanding is so expensive, or we would be out of our respective jobs!
I'm a whippersnapper, and I remember when the TRS-80 Model III that Dad brought home was choice supreme: it had 48k of RAM.
Momentarily staggered to think a pretty sweet little word processing program—Scripsit—ran in 48k...
Posted by: Kip Manley on June 21, 2004 04:13 PMI remember my software development dayswhen I had a server with a 10 MB full-height RLL drive that had a bad stiction problem. We had a rubber mallet to *THUMP* the drive at the appropriate moment if the computer (an original XT) had to be rebooted. Total cost of this system (sans mallet) when new? $8,000. (It had a bunch of RAM. Maybe 2 megabytes? Remember when 16 kilobit chips were made by soldering two 8s together?)
Posted by: JDC on June 21, 2004 04:23 PMI remember paying SunData $5K for a 13MB hard disk for an IBM S/34, and $400 for an external 40MB drive for a MacPlus.
Posted by: Linkmeister on June 21, 2004 04:48 PMgoogle is clearly not allocating space in 64Mb chunks, that would mean that 16 messages would be enough to take them over the limit.
I strongly suspect that even if they are using 64Kb sectors that their mail storage system is more efficient than one message per file.
The most effective way to use the storage would be to simply store the incomming mail in a rolling log file format with new posts being appended to the end in a continuous stream. Then maintaining the index as a separate entity. If you delete a post in the middle of the log you simply overwrite that part of the log with zeros.
The post misses out the easiest way for google to save on the cost of hardware - most users will not use anywhere near the 1Gb limit.
Posted by: Phill on June 21, 2004 06:49 PMMy assumption is probably one file per USER/MAILBOX, not message.
And my assumption IS that most users use the minimum capacity (one block) for their mailbox, which gets the $.12/user/year storage cost.
I'm not sure about google, maybe they use the epia for gmail but I thought they mostly used rackmount boxes with more cpu and less disk for their core search service. Our new petabox design uses similar hardware. You can check it out at petabox.org or visit us in the Presidio.
We use larger disks, custom rackmount cases, and include an lcd. These make our costs a bit higher than you estimate. Also, you need to allow nearly 10% extra for failures because we run commodity hardware 24x7. Still much cheaper than any enterprise solution.
It draws less power, about 60 watts for the box plus 20 watts for cooling, and you can buy power for as little as $0.07/kw if you play nice with PG&E.
Posted by: john berry on June 22, 2004 01:50 AMAllenM,
An *ibm* vt100 terminal? Are you sure it doesn't say "Digital Equipment Corporation" somewhere on that cathode ray bottle?
Posted by: dennisS on June 22, 2004 05:56 AMDoes triple redundancy really require three times the storage? Couldn't some hybrid, with say 2 full copies and a parity file be more efficient space wise?
Posted by: Andrew Cholakian on June 22, 2004 10:12 AMParity needs to be maintained which costs complexity. More importantly, google wishes to be able to stand 2 nearly simultanious losses (PC hardware IS notoriously flakey), so doing a parity type system, although saving space, would cost more overall, as you would need more reliable hardware so that you could avoid the requirement of 2 simulatanious failures: and google's secret is the ability to extract reliability out of really CHEAP materials.
Posted by: Nicholas Weaver on June 22, 2004 10:52 AMIf Google wanted parity they could use software raid on one of the disks. Actually, my guess is they probably do that for email, which requires a far higher level of reliability than their search engine. So for a small increase in disk costs, you increase your reliability immensely.
Posted by: Jon Juzlak on June 22, 2004 12:05 PMGoogle actually stripes data across machines, not platters, Jon. That way if a machine goes down, the data remains available. Google's proprietary clustered filesystem technology operates at the applications layer, not the OS layer, but is effective nonetheless at doing what it sets out to do -- provide a massively scalable, multiply redundant highly searchable file store. (Think IBM AS/400 without being tied to a specific hardware platform).
Bad Tux -- I understand that, but there can still be catastrophic failure. Even with 2 replicas. I think that using software RAID would be a low cost way of increasing fault recovery even with Google's file system. RAID 5 can be done for < 20 % additional cost. This is especially so for email, where the data is much more crucial than search data (search data can be rebuilt more easily).
Posted by: Jon Juzlak on June 22, 2004 01:25 PMSome points...
1. Google's initial implementation might allocate a user's mailbox in 64 meg chunks (one user per file) but an obvious optimization is multiple users per file. This might not add much more complexity than is there already (from reading the GFS paper, their filesystem probably tends to lead to a file structure that probably leads to lots of application re-parsing through to find the relevant bits, in such a way that it wouldn't be much of an extension to ignore records not belonging to the current user). The file system cluster size probably should not be considered fundamentally relevant to the economics of GMail. After all, free mail services tend to see a high rate of "disposable" accounts - secondary accounts that people don't use much, temporary accounts that they throw away, etc. With data compression (see below) the median user will likely use considerably less than 64 megs of file space.
2. GMail can (and maybe does) benefit quite a bit from data compression. Email is primarily text which is heavily compressible. Even with quite CPU-efficient algorithms it is easy to exceed 5:1 compression ratios. It does then cost more CPU to decompress the messages (and searches may have to decompress the whole mailbox), but remember that any given user is sucking up disk space 100% of the time but actually reading new messages, running seaches, and so forth only a small fraction of the time. GMail could probably get a significant cost win by spending some CPU power for substantial disk space savings. (This is also compatible with their existing design from the GFS paper in which the production file servers are Dual Pentium 3 boards).
GMail may well be using a quite simple implementation right now - after all, their initial goal will simply be to get market share and make a big pre-IPO splash. The actual cost per user won't be a huge deal until they have tens of millions of users. At that point, some software optimization that gets back ten cents per user per year starts to be worth some real effort.
Knowing the real limits on GMail requires knowledge of how it's implemented that we just don't have, and knowledge of their other services. For example, maybe GMail boxes have a CPU use pattern that leaves a lot of free cycles to run tasks from unrelated services as well. This suggests itself to me because the GFS paper reports that for Google search, they used dual P3s with 80 gigs of disc, a fairly high CPU:disc ratio. Adding GMail might not be a cost in terms of X PCs of Y configuration, but some storage-heavy fraction of each PC.
Jon: With Google's replication scheme, they have a really very different setup from most servers so RAID probably would not be of much use to them (even at less than 20% additional cost).
For Google there is nothing special about a hard disc failure. The way their file system works, the moment storage is unavailable to the network for ANY reason it must be re-replicated to maintain availability. And the process of doing so is completely autonomous. When a box goes down it is marked as "dead" and sits there until it is convenient for someone to go and replace it, which might be a day later. The file system continues on its merry way.
What difference does RAID make to this scheme? Instead of going dead immediately on a disc failure, the machine will go dead when somebody replaces the disc, no real difference... unless they make all the discs on all their machines hot-swappable. Which they are probably not going to do because that would interfere with their plan of using custom build ultra-cheap PCs with minimal hardware, rather than paying for the costs of actual server style hardware.
Their entire scheme is to not bother to boost the reliability of the components, but to simply recover rapidly and reliably from failure (which can be much more than disc failure... all the way up to a failing switch taking down an entire network of machines). RAID with hot-swapping would increase protection against disc failure, but not against all the other failures caused by their cheap hardware. Their whole revelation was that to protect against all these kinds of failures is a big part of the reason why servers cost so much more than PCs
Ian
Software RAID is free. There is no need for expensive server software, and the disk cost only goes up by 20 % for RAID 5. What is the advantage ? Well, given that the most common permanent failure mode is likely to be disk errors, software raid essentially doubles your replication for these. Maybe instead of their default configuration of master + 2 replicas, they can use master with one replica, with RAID on both. Unless they're completely using up CPU on all replicas, it would seem it is advantageous to reduce the number of replicas.
The requirements of a search engine are different from email. In the search engine case, if data is lost, no biggie, we can recreate it. For an email service to lose data is a major disaster. Hence, I suspect that Google may have put in more redundancy for mail. The GFS paper is probably out-of-date in this respect.
Again, the issue is not transient errors, but permanent errors with disks. We all know that major failure cases can happen, no matter how well something is replicated. For email, I would definitely use software raid on master and potentially also on the replicas to guard against the very rare possibility of completely failure on all.
Posted by: Jon Juzlak on June 22, 2004 03:13 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 6, 2004 02:38 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 6, 2004 11:01 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 8, 2004 03:21 PMGet it up mate, it's fun!
Posted by: Viagra on July 8, 2004 08:59 PMQuod incepimus conficiemus - What we have begun we shall finish
Dira necessitas - The dire necessity. (Horace)
An nescis, mi fili, quantilla sapientia mundus regatur? - Don't you know then, my son, how little wisdom rules the world?
Experientia docet - Experience is the best teacher
Muppets love Viagra!
Posted by: Viagra on July 14, 2004 04:26 AMIt gets yours up to the top dude! The girl will enjoy it!
Posted by: cialis on July 14, 2004 11:59 PMThanks for your blog!
Posted by: Kontaktanzeigen on July 16, 2004 06:30 AMHire a car to feel higher!
Posted by: Car hire on July 20, 2004 05:27 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 23, 2004 02:05 PM