Well! This is genuinely weird:
whtehouse.gov robots.txt: Why is whitehouse.gov (the official White House website) disallowing "Iraq" directories from search engine crawling?.... As of Oct 24, 2003 the robots.txt file at whitehouse.gov... has 1,620 "Disallow" statements.... There are 783 instance of the term "iraq" in this file... appended to paths that already exist in the file. These appear to have been added haphazardly, since the term appears in many path names for which no such terminal "iraq" directory exists.... However, this robots.txt file does exclude external search engine robots from some 75 directories that actually exist on whitehouse.gov.
Michael Froomkin has a number of reactions:
Discourse.net: I had a small cascade of reactions to this (via Eschaton).
First thought: It’s disgusting that the White House is trying to relegate its statements about Iraq to the Memory Hole.
Second thought: It’s great to live in a free country where this doesn’t work.
Third thought: This demonstrates the same level of technical (in)competence we see in so many things this Administration does.
Fourth thought: Maybe it does work more often than not — many people have come to rely on Google. Efforts like this often won’t get spotted most of the time.
Fifith set of thoughts: How do we prevent, or at least identify and publicize and warn about, this sort of activity in the future? Will this mean that commercial databases which keep pristine copies of things and promise not to santize still have a place? Can something like archive.org overcome this sort of attack on our online history? Is there anything Congress could or should do about this? (Needen’t ask “would”—we know the answer to that.)
Update: Sixth thought: Well, they just made it much less accessible (although people who rely on google might get the idea the statements didn’t exist), as far as we know they didn’t actually delete them. It could be worse. But it’s also more deniable.
Seventh thought: If I ran Google, would I now instruct my spiders to ignore the robots.txt file at whitehouse.gov?
I think the answer to this last question is clearly a yes...
Posted by DeLong at October 27, 2003 12:52 AM | TrackBack
Funny thing is, a lot of the paths in that
robots.txt aren't even valid URLs. For example,
my browser can fetch:
http://www.whitehouse.gov/agencycontact/text/
just fine, but
http://www.whitehouse.gov/agencycontact/iraq/
returns code 404.
So while it appears that the White House webmaster
is trying to prevent indexing of information about
Iraq on whitehouse.gov, it also appears that that
very same webmaster isn't doing a very good job of
managing his robots.txt.
Nicholas
I should read more carefully before posting. Keith
Spurgeon already noted that fact on his site.
Nicholas
Posted by: Nicholas Dronen on October 26, 2003 09:40 PMTo me, this looks as if the White House techies are reasonably honest on the one hand. On the other hand, some boss ordered them to censor the website, which they did. But their hearts weren't in it, so they did it in a way that was mechanical, semi-functional, and guaranteed to be spotted by some vigilant mind on the internet.
Why do I think the Whitehouse techies are "reasonably honest"? Because if you really, really want to hide content on your website, you don't use a file system as a (Google-searchable) database. You directly hook your webserver to a real database, let the database manage the file, and have the database's access control decide which article to churns out to whom. That way, you can censor your content, and nobody can tell from the outside.
This censorship attempt ought to create an outrage. But I don't think an outrage would cause the Whitehouse to stop censoring. It will simply cause it to make its website database-driven "for efficiency reasons" and continue their censoring in secret. If the outrage materializes, I expect this to happen within a few months.
Posted by: Thomas Blankenhorn on October 27, 2003 01:44 AMOK OK I'm paranoid or, as Michael would say an ideological positivist, but when I tried to open the forbidden links and nothing happened I assumed that whitehouse.gov had deleted the files.
Slowly slowly I realised that the problem is that they are jammed with traffic. Eventually they open to show smiling Iraqis greating the US armed forces (remember that it was just a few months ago).
Actually speaking of Iraqis greating the coalition forces, remember the people in Baghdad who showered the ariving troops with flower petals. Well just think, they had to have considerable courage to buy and store the flower petals. I mean let's say the local Ba'ath party guy finds out that you have 10 pounds of flower petals in your apartment just as the infidel invaders are approaching Baghdad. What do you say ? "I am a very incompetent florist". I was waiting to celebrate the approaching Iraqi army victory ?
Well now I have my new paranoid theory. No one was reading the Whitehouse.gov spin on Iraq so they decided to pretend they were trying to hide it. That way everyone would surf over to see what they were hiding and be reminded of how glad the Iraqi's were to see us.
So you see maybe they are incompetent after all.
Posted by: Robert on October 27, 2003 05:07 AMYou keep asking "why oh why are we ruled by these fools?"
After resisting the obvious for quite a while, I'm afraid I now think the answer is this:
http://www.scoop.co.nz/mason/stories/HL0310/S00211.htm
--and a bunch of other stories like it. Sorry about that.
Posted by: Patrick Nielsen Hayden on October 27, 2003 05:33 AMI can think of two scenarios that aren't entirely driven by motivations of a vile sort.
The site managers are maybe adding things to robots.txt that are older than some threshold so that searchs are more likely to return fresh material. That's not unusual at a site that is trying to draw traffic to new material rather than old. It maybe that those iraq pages all appeared as a group exactly N months ago.
The other scenario, quite common, is that the site managers have moved those documents to another venue. At that point to avoid broken links they have two choices they can tinker with the site so that automatic redirects send old linkers to the new venue or they can just leave the olde documents were they were. Not wanting to encourage further linking they add them to robots.txt.
This business of manipulating search engines is getting frequent enough that it amounts to a form of systematic censorship. There are a number of stories that I can cite in which quite recent stories vanish not only from a website, but from Google's cache. An interesting case occurred with a protest in front of CNN. The matter was duly reported, and appeared in a Google search. Then it vanished from the website, but could be found in Google's cache. Within 24 hours, even that was gone.
Were it an isolated incident, I'd shrug it off as tech-naw-low-gee. But other incidents have occurred. One is nicely documented in several Salon articles.
Posted by: Charles on October 27, 2003 06:01 AMIt looks like an accident, or maybe a prank. The WH site has text versions of most pages as an alternative to the graphics-and-text pages. Someone wanted to set up a robots file to block the text version from indexing so that the search engine wouldn't find two hits for each page, a fairly common practice. Whatever script they used to generate this also generated the same URLs but with "iraq" on the end rather than "text". I doubt if any of the URLs with "iraq" on the end ever worked.
Posted by: dc on October 27, 2003 06:27 AMuh, dc, i just clicked on a bunch of the links in the robots.txt file and they took me straight to nice graphics-filled pages.
not saying you're wrong, but...
"I think the answer to this last question is clearly a yes..."
Google's credibility in dealing with web site operators is important. You should know that life, for those who have one, isn't a 24/7 no-holds-barred political struggle.
I'm glad you're not running Google, or ever likely to. Thankfully the twenty-somethings over there have more sense.
Posted by: JK on October 27, 2003 09:07 AMWhich links did you try, Ross?
As an experiment I used a script to load the first 200 links from the robots.txt file that end in "iraq". Without exception they redirected to
http://www.whitehouse.gov/error-404.html, which means they do not link to any real content on the site. This is hardly surprising for URLs like
http://www.whitehouse.gov/holiday/2002/petsculptures/iraq/
I would also like to know which links "Robert" opened to find "smiling Iraqis greating the US armed forces".
Posted by: dc on October 27, 2003 09:07 AMI found this while perusing the White House Website:
Steve Friedman, Director of the National Economic Council, will take your questions today at 2:30 p.m. (ET) at www.whitehouse.gov/ask.
http://www.whitehouse.gov/ask/question.html
I'm sure you folks can come up with some good questions for Steve!
Posted by: Kosh on October 27, 2003 09:17 AMMy first thought was this was an attempt to thwart *archiving of previous versions*. Google has the "cached" link which shows the version of the page that was actually indexed -- behaving a bit like an archive. Without that archiving, the website managers can freely alter the text on those pages as much as they want and claim that that was the original text from x-days ago.
But as has already been mentioned, if it's an actual attempt to do that, it's a lousy one.
Posted by: Ken Overton on October 27, 2003 09:23 AMSubj.: "Technical (In)competence" hypothesis
http://www.whitehouse.gov.crazy.sytes.org/
subj.: third thought
http://www.whitehouse.gov.crazy.sytes.org/
subj.: third thought
http://www.whitehouse.gov.crazy.sytes.org/
If I ran Google, I'd configure the spiders to spend extra time searching the very directories described by the spiders.txt file.
It's not often someone tells you what they don't want you to look at.
Posted by: Jon H on October 27, 2003 12:28 PM