« 20050118: Econ 113 Opening Lecture Notes | Main | Yes, There Is No [Social Security] Crisis, But... »

January 18, 2005

Chuq Van Rospach on Comment Spam

We cannot defeat the comment spammers, he writes, but we must fight them nonetheless:

Teal Sunglasses: Why 'rel="nofollow"' isn't the answer.: Or is at best only a partial one.

I want to applaud everyone involved for coming together and trying to deal with comment spam. Nofollow is an interesting idea, and it may well help. I don't wish to be overly negative about things, but I just can't convince myself that this is really going to solve the problem. (I will happily settle for 'make it better', though)

Problem 1: this only works if people upgrade their systems and use the new feature. That's not a problem at TypePad where the upgrade is managed by the company -- but if you look at the history of security upgrades in the general user community, it's not pretty. Just look at how well we've solved the zombied PC problem (heck, look at MY blog, where I'd expected to upgrade months ago -- and I DO upgrade stuff). The internet is littered iwth "install and forget" installations that never get upgraded, never get patched. So I'm immediately skeptical that we're going to get critical mass of usage to cause the spammers to decide it's not worth it any more and go annoy someone else's technology.

Problem 2: what's it cost them to comment spam? If you make the assumption that this is attempting to make comment spam uneconomical (monetarily or simply "not worth their time") -- look how successful that's been in the e-mail spam world? the costs are likely smaller for comment spam, because you don't have the high traffic volume on the network email spam has for the spammer -- yet even if 60% or 70% or 80% or more of the spam is blocked, you don't see spammers giving up, and I haven't noticed that spammers (or worm writers) giving up on domains that are good at blocking spam or protecting PCs from being infected. The open relays and unprotected servers and spam that does get through is what they care about, so they hammer away at everyone, because there's no reason not to.

So I guess based on what we see elsewhere, my worry is that we'll implement this, and we'll still get hammered, even if the comment spam that does get through is useless to them, because they don't care. The smart spammers will likely teach their tools to look for the flag and go elsewhere and not waste their time -- but if the "install and forget" mentality exists in OS installs and blog software installs, it exists as well with spammers who download the scripts and use them without understanding them or really knowing what they do, too. So those script kiddie spammers will likely hang around no mater what....

But don't get me wrong: I'm all for this. I don't want people to think that I think that because anti-spam stuff hasn't stopped spam we shouldn't use it -- the opposite is true. This is a good step in trying to get this problem under control, I just don't think it'll have the designed effect of making the spammers go away. If we're lucky, those of us who implement it will find the spammers will go annoy other people (the "Club" effect -- we're not out to stop car thefts, we're out to make them steal your car, not mine), but even if not, it's a step towards dealing with something that's long overdue, and I'm really happy to see the involved groups get together and take that step. So reservations or no, I fully support it and wnat to do what I can to make it work.

And we should look further into what we can do next....

If I were google, I would have taken a different approach. I would not have said, "if you make sure links have a 'nofollow' tag we won't include them in our index." Instead, I would have said, "Links that are created by somebody other than the page author--i.e., links inside blocks labeled as comments--will *not* be included in our index unless they have a 'follow' tag inside them." Given where we are now, I think the second approach would have done more good and also improved the quality of google's indexes.

Posted by DeLong at January 18, 2005 09:08 PM

Comments

[Be polite.]

Posted by: at January 18, 2005 09:49 PM


[And a comment spam makes it through...]

Posted by: at January 18, 2005 10:39 PM


[Be polite.]

Posted by: at January 18, 2005 10:45 PM


How would that have helped? Wouldn't comment spammers be the ones who know all about rel=follow? So wouldn't that mean only their links get PageRank, not good comments?

Posted by: Aaron Swartz at January 19, 2005 09:25 AM


How exactly would Google's spiders know what a comment box is? Last time I checked there wasn't a html tag.

The problem was of semantics... Just by looking at an anchor tag, Google didn't know if it should follow the link or not and thus add the link to its mystery PageRank stew. This semantic didn't exist before in html, so you had to add it in some way... i.e. the ref="nofollow" attribute.

We're a long way from Google being able to *infer* general meaning from content. The comments you made earlier on PageRank (you wanted your rank to apply to things like Economic History not book reviews) are a wish for Google to infer meaning out of content. PageRank is Google "infering" page importance via link structure, but that is based on just that, link structure (i.e. little bits of html code). You're asking for, and we all want, a full blown inference engine (a la the human brain).

Posted by: Will Ambrosini at January 19, 2005 09:26 AM


That first line should read: "How exactly would Google's spiders know what a comment box is? Last time I checked there wasn't a <comment> html tag.

Posted by: Will Ambrosini at January 19, 2005 09:35 AM


Jeez, I got my webpage wrong on that last comment... Maybe, I should go back to being a lurker.

Posted by: Will Ambrosini at January 19, 2005 09:40 AM


So what is this person's recommendation? Sounded kind of resigned to failure. I liked DeLong's idea.

Posted by: tre at January 19, 2005 01:38 PM


I think their solution was to say that google's solution was ok.

I like DeLong's idea too, but given that there is no clear way to determine what was generated by comments, Google can't exactly have such a strict policy. If they could, I bet they would.

Google does use heuristics about page structure to weight links, so they might already give less weight to links they think are in comments, or are spam.

The other key thing I've noticed is that older less updated pages get less clout on Google. This is key, because most sites that this applies to are dynamically generated. That means the no-rel link can be applied retroactively very easily. That leaves spam on sites that are old, stale, and not updated. Any links on these sites will slowly die off in their relevance.

Anyhow, the change is gradual, and it will take awhile to have any effect. But it is a good step, and one of many that will help fight comment spam. If anything I wish there would be a tag to let spammer bots know I have protected my site. People who use group services like TypePad will probably benefit from rel="no follow" long before individual sites do.

Posted by: Brad at January 19, 2005 03:39 PM


Aaron, comment users don't add the ref='nofollow' (or if somehow Dr. DeLong's idea was implemented... ref="follow") attribute, the blog software does. (View source on this page and you'll see that moveable type added the ref="nofollow" attribute to tre's website.)

The Professor was not suggesting that Google stop following all links by default and only follow links that have the ref="follow" attribute. He wants Google to recognize comments for what they are (i.e. generated by comment users and thus susceptible to comment spam abuse) and then not follow links in comments, by default.

I'm suggesting that Google's solution is the only viable solution. The professor's solution amounts to implementing an inference engine that can understand any context to tweeze meaning out of any content. Google isn't magical... How would it know what a comment was? I suggest the brain is the only known implementation of such an inference engine and it will be very hard to create a software implementation (some philosophers say it will be impossible... see Searle's Chinese Room).

Posted by: Will Ambrosini at January 19, 2005 03:41 PM


rel=follow wouldn't work. Anyone can add that to a link, including spammers. You don't really need links in comments to enter into the Page Rank formula. Google said themselves that this wasn't the solution, but it is a pretty good one. And the Page Rank situation was their fault, so they should have been the ones to come up with a workaround (which is all this is).

Posted by: danny at January 19, 2005 07:15 PM


[Pointless]

Posted by: at January 19, 2005 10:16 PM


Brad, for your suggestion to work, Google's spider would have to parse every Web page it encountered to determine where comment blocks began and ended. Given the speed at which Google is trying to keep up with the Web, and given how much badly-formed HTML is out there, it's much easier to search for , ignore the higher-level structure of the page, and then pull out the information inside the tag.

Posted by: Seth Gordon at January 20, 2005 08:59 AM


Well, magic isn't necessary. The problem is that there are so many different comment systems with no standard of specifying what part of a web page is comment and what part isn't.

I think it's probably also easier for the spider to be able to make the follow/no-follow decision from the information in the link, rather than from the context around the link.

Not sure how to address the problem that bloggers don't update. Even Microsoft's automatic updates hasn't fixed this. Maybe Google should be implementing a spammer blacklist?

Posted by: fling93 at January 20, 2005 06:01 PM


["comment spam" is not good as a post title, is it?]

Posted by: at January 21, 2005 05:31 PM