" Posted by DeLong at April 23, 2004 07:41 AM | TrackBack | | Other weblogs commenting on this postACM Queue - A Conversation with Matt Wells - When it comes to competing in the search engine arena, IS bigger always better?: MW Gladly. Search engine spam is not exactly the same as e-mail spam. Taken as a verb, when you spam a search engine, you are trying to manipulate the ranking algorithm in order to increase the rank of a particular page or set of pages. As a noun, it is any page whose rank is artificially inflated by such tactics.
Spam is a huge problem for search engines today. Nobody is accountable on the Internet and anybody can publish anything they want. A large percentage of the content on the Web is spam. The spammers are highly motivated to garner as many high-ranking positions as they can. It translates directly into more financial rewards. One of the simplest ways to subvert a search engine, or at least try to, is to repeat keywords in the Web page. This tactic worked well in the mid-90s but not so much anymore. One of the more common methods today is link spamming. Webmasters and SEOers [SEO stands for search engine optimization] exchange links with each other to fool the link analysis algorithms that all the big engines employ. Some of the more evil Webmasters will purchase hundreds of IPs supporting thousands of domains hosting millions of randomly generated pages. So you get this entirely artificial Web community that boosts itself to the top of the results.
Each page has something like 1,000 random words. These guys are aiming for the tail end of the normal distribution. So anytime someone searches for a somewhat unpopular combination of terms, these spammers come up on top. Banning their IP addresses is not enough, because they move their domains around on a monthly basis...
SK Any other quality issues?
MW Yes, there's more. Duplicate content. Mirrored sites. When you do a query, you don't want to see the same content listed a hundred times. In a lot of cases the sites in the search results will all be mirrors of each other. So a good engine will have mirror detection, or duplicate page removal algorithms that will excise the dupes from the search results. But if your duplicate detection parameters are too loose, you might end up removing important pages. It's a touchy subject. Also, you now have businesses, such as Amazon and Reuters, encouraging people to incorporate their XML feeds into their Web pages. Everybody will dress the content up in different ways so that no two pages are exactly alike, but really—content-wise—they are the same and you'll expect a search engine to detect this and cluster the duplicates out of sight.
SK How do you specifically address some of these issues in Gigablast, such as taking the big intersection of doc IDs or dealing with spam?
MW I would really love to get into some of the other technical challenges and ways I address them through Gigablast, but if I go too far I may end up jeopardizing some of my trade secrets. When making public disclosures, I have to balance my technical side with my business side. Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing. When you run a search engine, all you have is the code. It's just software. So it's all about algorithms. You have to protect what you have, and the only way to do that is to keep your mouth shut sometimes.... On the technical side of things, I would say that we are just beginning to realize the full potential of the massive data store on the Internet. Companies are struggling to sort and filter it all out.
SK I'm interested to know what you think of Google. Can you tell us something about that?
MW Google is definitely the one to beat. It has a near monopoly on the search market, but that's because it wisely focused on quality search results when everybody else was too busy turning into a portal and neglecting their search departments. Ahem....
I think it is important to point out the guy interviewed runs a tiny search engine out of his house off of a handfull of servers. Google has some amazing things going on under the hood, though. Their biggest coup has been to manage such a HUGE distributed system. I mean huge.
Posted by: heet on April 23, 2004 09:26 AMHere is a useful how-to sight on searching the web:
http://www.searchengineguide.com/howtosearch.html
The first link takes you here:
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html#LinkwithQ
(Just a little plug for Berkeley in deference to our host.)
And then there is this site with a critical view of Google (there is a history here, just so you know, but some interesting stuff none the less).
http://google-watch.org/
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"I think it is important to point out the guy interviewed runs a tiny search engine out of his house off of a handfull of servers."
Well, what of it? I thought we were evaluating the quality of his ideas here, not the depth of his wallet; besides, everyone has to start somewhere, and since when have "liberals" preferred to root for those with deep pockets over the small guy?
Matt Wells makes important points about the need to deal with link spamming and repetitive content, and as he hints, there are already ways available of dealing with both problems. The difficulty with discussing these solutions, as Wells points out, is that short of resorting to patents, would-be entrepreneurs in the field of search have little protection against the established players stealing their ideas, so maintaining trade secrecy is paramount.
What this translates into is that people like Wells can stick to raising the problems bedevilling search on public forums, but when pressed on possible solutions, they either have to keep mum, or else face accusations of "empty boasting" from those who cannot imagine that anyone outside the dynamic duo of Brin and Page without millions in the bank could possibly have something interesting to bring to the table - as has been my own experience on this particular topic.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFAiX62OgWD1ZKzuwkRAll/AJ9q/1+gQKRbVVpFiD8wB7pxe/2q8gCfa/5r
IQSpew5vLd626GZu3yrVXb4=
=/OLy
-----END PGP SIGNATURE-----
Abiola sez:
"I think it is important to point out the guy interviewed runs a tiny search engine out of his house off of a handfull of servers."
Well, what of it? I thought we were evaluating the quality of his ideas here, not the depth of his wallet; besides, everyone has to start somewhere, and since when have "liberals" preferred to root for those with deep pockets over the small guy?
-----------
I'm not sure what your point is. Pinging me for being a "liberal" (nice scare quotes my man) or rambling about the little guy? My point, however, was to show that this person may not be exactly in the know regarding the cutting edge search engine techniques.
Posted by: heet on April 23, 2004 05:09 PM-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"My point, however, was to show that this person may not be exactly in the know regarding the cutting edge search engine techniques."
And *my* point was to show that the size of the guy's wallet tells you absolutely *nothing* about whether or not he's "in the know." How does his possible lack of knowledge follow from the fact that he's operating with very little by way of resources?
Your reasoning is a classic case of assuming that only the big guys can have anything worthwhile to say, which leads to the obvious question of how they went from small to big in the first place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFAijuBOgWD1ZKzuwkRAqajAJ9K1Ky4GPJCMk1UHO8j8AyKDs9wBwCcDxTW
EJbkferorUJzWZSr1YsJiiE=
=Mmcj
-----END PGP SIGNATURE-----
Online Casino Directory
Posted by: online casino on June 23, 2004 06:20 AM