Abiola Lapite comments:
Foreign Dispatches: Google and Metadata: I'd like to discuss is why exactly it is Brad finds it so much easier to discover things he's interested in by using Google to search his own website than it is for him to do so by trawling through his own machine, and it's a point that has a great deal of bearing on the likelihood of success of ambitious proposals like Microsoft's WinFS. As Brad himself notes, the real reason why Google is able to do such a good job with his data is rather mundane: by leaving his notes on the web for all to see, Brad DeLong makes it possible for others to link to and comment on his posts, and in so doing they create for him the metadata required for him (and anyone else) to do efficient searches of his own website. Metadata creation can be tedious in the extreme if one has to do it all oneself, but given the combination of a web of millions, each of whom need only do a tiny bit of the job, and a high-traffic website which gets thousands of reads on a daily basis, all the ingredients are in place for Google to return results of far better quality than any desktop tool can currently offer.
What does all this mean for WinFS? Well, for one thing, given what we know about the (lack of) advances in AI over the last few decades, it means that the only way WinFS could possibly come close to living up to Microsoft's promises for it would be for it to expose the private datastores of Windows users to the rest of the world, a prospect that is as uninviting as it is unlikely, especially for a company with as poor a record on security as Microsoft's. Google needn't worry too much about competition from Redmond just yet.
It certainly sounds like Abiola is making a lot of sense. I share his sense that the key problem lies in creating the metadata about files that appears to greatly improve the quality of searching based on full-text indexing. I believe that those who have the mental discipline to create their own metadata about files on a regular basis are few and far between. I recognize his--important--point that my strategy of "posting things that I want to find later" works only if one has a website with high enough traffic that Google can serve as an effective metadata aggregator. And I share his skepticism about whether Longhorn's WinFS is going to prove to be solving the right problem.
I want to think about this and related issues more systematically. Let's take Donald Rumsfeld's four catagories: the known knowns, the known unknowns, the unknown knowns, and the unknown unknowns:
The known knowns: The things that you know, and that you know that you know. Here there is no information retrieval problem at all.
The known unknowns: These are the things that you know are on your hard disk someplace, but you're not sure where they are or what, exactly, they say. Your recollection needs to be refreshed. Here is where search based on full-text indexes plus high-quality metadata shines. We know how to make full-text indexes. We know how to search such indexes plus metadata. The only potential problem is a social engineering one: how to make sure that high-quality metadata about files is created and maintained.
The unknown knowns: Once you have found your known unknown, you then want to find what other files on your hard disk are related to it. The same keyword and text search won't necessarily pick them up. This is what subdirectories--folders--are supposed to be for: one of the benefits of grouping related files in subdirectories is that one can then thrash about and get hold of related information. And, because one file may well belong to more than one possible group of unknown knowns, we have symbolic links--aliases. Once again, however, there is a social engineering problem: how to make sure that files are sorted into the right folders and that the right symbolic links are created, for this task can also be "tedious in the extreme." And we are vain and lazy infovores.
The unknown unknowns: These are things that one would search for if one remembered enough about what was on one's hard disk (or knew enough about what was on the web) to know that one should look for them. Here we have a very difficult problem: how do you jog someone's memory or tell them enough about what is known so that they can figure out what kinds of things they can search for? I think that this is a very hard problem indeed.
The "known knowns" problem is not a problem for anyone. Exposing a large chunk of one's hard disk to Google via the Jon Udellesque strategy of "posting things so that you can find them later" solves the "known unknowns" problem for those of us lucky to have high-traffic websites that Google can sink its fangs into. But what about those of us who aren't lucky enough to have high-traffic websites? The only advice I can give is to build a time machine, travel back to 1999, and begin blogging then under an assumed name...
And the "unknown knowns" problem (for those of us who lack the mental discipline to maintain a coherent subdirectory-and-symbolic-link organization and whose hard disks look like a librarian's worst nightmare)? I have no good answer. And the "unknown unknowns" problem? Ha!
Posted by DeLong at June 18, 2004 04:29 PM | TrackBack | | Other weblogs commenting on this post"how do you jog someone's memory or tell them enough about what is known so that they can figure out what kinds of things they can search for? I think that this is a very hard problem indeed."
As far as your computer (or department file collections) goes, what might be useful would be a book-style index for your files, where you can see the terms in the index, which you could browse through.
You'd just see a list of words, in alphabetical order. Click on one, and you'd see a list of files with that term in the title or contents, perhaps with a text summary or the first paragraph if possible.
It would probably require a huge stoplist, to exclude words that wouldn't show up in a book's index because they aren't significant enough. You don't want to lose important words because you're skimming past words like "after". This would probably require a way for the user to tune the list.
It'd be useful to have tools with which users (or interest groups) could create domain-specific termlists, which could be shared and used to create domain-specific views into the filesystem.
You might run a bunch of economics articles, or an economics glossary, through a tool, which would create a corpus of economics-related terms, which could then be used to narrow down the filesystem index to show only files containing those terms.
Once you have the set of terms in all your files, you could just find the intersection of that set with the set of economics terms you built, and show the resulting list, which would be the list of economics terms which are found in some file. Click on the term, get the list of files.
I suppose similar term lists could be built over time; perhaps there could be an OS X Service that would add the selected words to a "priority termlist", of terms you're most interested in. Or emails you send could be scanned for terms, or there could be a Service for scanning terms off the current webpage.
Posted by: Jon H on June 18, 2004 06:26 PMIsn't most of the meta-data about a web page that is created by google just the product of a simple indexing of the words contained within an article? Even Brad doesn't get so many post-specific links that the relative page rank of different posts will be a robust guide to where that phone number is?
If that is so, the bulk of the value of Brad's procedure would be gained just by indexing e-mails, pdfs, word files, power point files, address book entries and so on in the manner of the programme DEVONThink mentioned on here recently but more pervasively and automatically, like grep but faster and more easily configurable.
Much of what remains could be replaced by importing some similarity measures, possibly tailored to specific purposes. The large news distribution services,Factiva and so on, have such things and companies like Verity and Autonomy can do a lot too.
Combined with disks large enough to store every page you view and the order in which you view them, your weblife bits as it were, your files on computer could be made substantially more useful without having to be exposed to the outside world.
Posted by: Jack on June 18, 2004 07:18 PMBeing active in the area of web search I can't resist adding my 2 cents. The meta data created by other users, such as links, is important for web search quality, but also should not be overestimated in its impact. It makes for a great story (e.g., Google's Pagerank) but probably only accounts for part of the observed difference in quality - other aspects are the different data types and search tasks (what are you trying to achieve? Finding a homepage?) in the two scenarios.
There is however another source of data that is more easily available on your desktop than on the web due to privacy issues: personal access patterns - your personal history of accessing data and documents. This is potentially a very valuable source of information, and there is an interesting research project called "Stuff I have Seen" by Susan Dumais at Microsoft Research that tries to harness this. Also, search engine manipulation is maybe less of a problem on your desktop. So, there is a lot of hope for search of personal data even without any new AI techniques,
but many tools have not arrived in broad circulation yet.
Abiola is overstating the difficulty of gathering metadata, and there is an overstatement of the scope. Google scope is all Websites.
WinFS scope is a a single computer -- eventually to a larger network.
Currently windows XP has pretty good indexing functionality( hidden and must be activated) that would work better than Google does now for Brad.
Obviously, I can't look under the hood of Google, but I think that my Google searches are most successful when I can hit unique identifying keywords or keyphrases. I don't think this is a solution that changes much because of user comments in a blog...
Grep in a big directory of all my memos has worked out reasonably well so far...
Posted by: ArC on June 18, 2004 07:44 PM-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"the bulk of the value of Brad's procedure would be gained just by indexing e-mails, pdfs, word files, power point files, address book entries and so on in the manner of the programme DEVONThink mentioned on here recently but more pervasively and automatically, like grep but faster and more easily configurable"
Proffy,
First of all, as I explicitly mentioned in my post, Windows 2000 and its successors already have the capacity to carry out file indexing, provided the necessary filters are hooked into the Indexing Service API. The problem with this is that it's a gigantic resource hog, and it isn't widely supported by third-party vendors; see
http://msdn.microsoft.com/library/en-us/indexsrv/html/ixufilt_912d.asp
in the Platform SDK for more information.
The big problem with all of the above functionality is that most of the really interesting stuff on our hard drives is stored in binary formats that traditional filters aren't good at indexing in a meaningful manner: for instance, no filter I know of will help you when you're looking for that 1 photo out of 15,000 all stored in your "My Pictures" directory. Then we've got files that are either stored in proprietary formats, or, like Postscript, are written in Turing-complete languages, which means there's no feasible algorithm that can ever be guaranteed to search through all of them in a finite time period (Halting Problem).
"There is however another source of data that is more easily available on your desktop than on the web due to privacy issues: personal access patterns - your personal history of accessing data and documents. This is potentially a very valuable source of information, and there is an interesting research project called "Stuff I have Seen" by Susan Dumais at Microsoft Research that tries to harness this."
But this profiling of access patterns does nothing to help in just that scenario Brad mentions as being most vexing, i.e, when you don't even know you have something stored on your machine because it's been ages (if ever) since you last examined it. This is going to be an ever more pressing problem as storage costs continue to plummet.
A distributed system where one can call on the access profiles of potentially millions of individuals is likely to be much more comprehensive than anything based on a single individual's activity record; note that filesystem caching wouldn't be effective were it not the case that typical usage patterns are highly predictable.
"Abiola is overstating the difficulty of gathering metadata, and there is an overstatement of the scope."
theCoach,
Au contraire. There is no single more difficult problem in information retrieval and knowledge representation that I'm aware of. Try following the discussions that have raged in the W3C over the years with respect to RDF, OWL, XML Schemas and the like and you'll get a better feel for just how tough the problem is.
"Google scope is all Websites.
WinFS scope is a a single computer -- eventually to a larger network."
Sometimes one has to expand a problem's domain the better to solve it. If you've got 15,000 images stuck on your hard drive, do you really think you'd be better off waiting for some super new piece of technology to annotate them for you, or just letting random visitors on the Web link to them according to their interests, with helpful information like "an amazing picture of Victoria Falls at sunset" where the filename is just "12586.JPG"?
The best thing about PageRank is that it also weights links by the importance of the linkers, so the most important metadata is likely to be precisely that which weighs most heavily in the balance.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFA07g7OgWD1ZKzuwkRAr7lAJ9x5iEIerNhiAcnnkNR0CSw0KGZCACfXGAR
r3RT3RFe1X28cC3R4T3x1Yo=
=RvJE
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Amusingly enough, here's a quotation I've just discovered while searching Google for the term "distributed annotation"; I say "amusingly" because the quotation happened to be located at the following URL
http://www.j-bradford-delong.net/movable_type/2003_archives/000032.html
which in fact I only stumbled upon due to someone else's annotation of Brad's post, at the following URL:
http://www.klynch.com/archives/000030.html
What's that about the unimportance of third-party links again?
Anyway, here's the quote:
"Larry Page: "It wasn't that we intended to build a search engine. We built a ranking system to deal with annotations. We wanted to annotate the web--build a system so that after you'd viewed a page you could click and see what smart comments other people had about it. But how do you decide who gets to annotate Yahoo? We needed to figure out how to choose which annotations people should look at, which meant that we needed to figure out which other sites contained comments we should classify as authoritative. Hence PageRank."
I fail to see how any locally-confined system like WinFS could be relied upon to pull such an unexpectedly relevant quote to the top of the search list in the manner that Google just did. What if the post above were just one of several thousand notes on Brad's hard drives which happened to mention annotation?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFA07raOgWD1ZKzuwkRAimqAJ0dTkrisnOsWYjmXcFOOGqbsLm56QCeMQ9w
QQ6sLP6fCIvGDCdjo0riz1Q=
=I7ym
-----END PGP SIGNATURE-----
Hmm...I think there are really two problems here (at least). Brad is right that it's easier to use Google to search one's own website than to do it, er, "manually". Also note that you don't have to have a huge amount of traffic for this to work; my course websites all only have a smallish number of links incoming, and they seem to be indexed more completely than I might prefer at times. :-) But I'm not sure that this is easier because of metadata, or because of good old-fashioned flattening of the representations. Suppose I want to find out where the heck I used the word "pulvinar" in my last course; was it in the lecture notes? The slides? A PDF version of some of that? Where? In this case, all I need to know is that the (rarish) keyword was web-visible in my domain, and a *one line query* can reliably find it. Clearly, this isn't quite as useful if I'd used the word 50 times in 30 places, but even then, Google hands me back enough context to help me generate a better query. (I guess this is related to the known/unknown stuff DeLong was talking about).
But that's just a small, literal-minded problem. The value that Google adds above that is that it has some notion of which mentions of a given word are "more important" on the web in general, and that's huge given how many "hits" you get with even a non-trivial query. But even this won't save you from the fact that the "best" resource for you might be one that didn't use the term you searched on exactly, but rather an abbreviation or an alternative label or...any of the things you would need real knowledge to uncover. That's sickeningly hard.
But me? I just want a gmail account so I can find stuff in my email inbox as easily as I can find it on the web. (Grumble, grumble; I've been sending and getting email since 1986 and have the complete archive and poorer search tools for this stuff than the average teenager has for pirated music.)
Posted by: Jonathan King on June 18, 2004 09:16 PMI think you've mis-identified the last group - the unknown unknowns. Let me re-tag the whole to try to clarify, and incidentally point to possible solutions.
The array is [subject area] by [item]. [SA] may (and probably is) subdivided - example of a typical US library is that the economics texts are going to be found in the 330s, of which macroeconomics is going to be in the 339s, and Brad's textbook is going to be about 339.076. Item can be equally subdivided depending on whether it's a relevant phrase, sentence, paragraph, and so on up to sheer body of work.
In this array, subject area is "know it exists" while item is "know details". I might know the subject area of economic geology (Dewey 553) - exists, but I know absolutely nothing that fits within it. (for the curious, mostly gems and minerals, though my local library has a book about water in that category.) Complementing this, I might know of Robert Heinlein's "Starship Trooper" but be completely ignorant that there are commentaries and discussions on both the book and the views espoused in the 813s (American Fiction) and the 321s (Political Science - systems of governments and states), and possibly a couple of other places. Unknown-Unknown is at it's broadest everything else. But what you're really interested in all these is a subset - that is, what out of these are of interest to you. OK, that's the start point, let's see where we go.
What Brad described as unknown-unknown is really known-unknown - you know the subject, you're just hoping to find more that meets the subject - perhaps slightly spun-off (and so falling under multiple subjects - see the item above - which will conveniently lead you to another subject which moves you into a new unknown-known region.)
In a library, the KK, UK, and KU are easily resolved these days - go to the catalog, be it card or electronic - and follow the subjects. Don't pass by the "see also" and "related term" and "near term" sections - and while you're at it take some time to check the superior (that is, if you're at 321 go look at the 320s) and adjacent (the 322, 323, ..., 329) classifications. And if you've an author you deem relevant, search by his or her name. Don't fail to do the same with the publisher. Now we've got a manual - a non-web - method of accomplishing part of what we're trying to do. We can try to use this method directly on the web - well, sort of.
I'd expect to see it in two techniques electronically - and I've seen both in some form at one time or another in other search engines. The first is a thesaurus search and the other is a chained search.
The thesaurus search is: "search for similar terms". In a perfect system all possible similar terms, phrases and concepts would be set out for you to choose amongst. In a more likely world the more commonly used terms, phrases and concepts are all you'd see. Ideally there'd be some sort of sub-cuing system to cope with the times that different subject areas use the same term or phrase in different ways - the so-called "false hits".
The chained search is, well, "If you like this you might like that." Given an initial list of responses, I check a "Useful?" box and then a "search spinoffs" button, which gives me things that weren't really in my first search but which are related to the things I used - the "see also, near term, related term, other things same author, other things same publisher" equivalents. Many blogs, by the way, accomplish much of this with the combination of trackbacks and links. It's not perfectly meeting the desires of Brad and Abiola, but it does exist.
What? too much work - that brings too many items for you to peruse? Exactly - and that's the typical flaw with comprehensive engines and indexes. The significant gain, though, is that it moves items from UU to UK or KU - depending on whether you stop with knowing the subject (or author or even title) is there or if you actually read something from your searches.
But that leads into the unspoken desire - to eliminate the work we'd like all those links, but prefiltered. Sorted by usefulness, by desirability, by how interesting we'll find it. It's possible, of course, but far less easily done. Google's mechanism fulfills one of these issues - how frequently has it been touched by others. But if what you want is on a rarely touched site in the backwaters, it may be completely missed (as already noted). Google (for example) has a modified engine out that somewhat touches on Proffy's point - their "personalized interests" search [ http://labs.google.com/personalized ] - but it's not yet comprehensive, and worse suffers from "what I'm generally interested in at the time I set it up" over "what I'm particularly and peculiarly interested in this time."
Still, I think what Brad and company want is coming - the early tools are there, and indeed if you're willing to stretch what is there you can get quite a bit (boolean's, manually chaining, personalized search profiles). Part of it's just realizing that you do have to do some work - both to get anything and to avoid being buried in garbage. Well, that and recognizing that Sturgeon's law still applies.
Oh, last point. There is really only one other way to find Unknown-Unknown materials -- sheer serendipity. You stumble across it when you mistype an url or browsing a shelf, or you overhear it in a conversation, or it reaches out and slaps you in the face of your expectations. Functionally, this is true even if you got there by extending from earlier connections - you didn't know till now, but lo and behold there it is.
Just my opinions of course.
Posted by: Kirk_Spencer on June 18, 2004 09:19 PM
Abiola - just a few quick responses. I don't
really disagree with most of what you are saying
and most is a matter of degree. But here are a
few remarks.
Abiola: "First of all, as I explicitly mentioned in my post, Windows 2000 and its successors already have the capacity to carry out file indexing, provided the necessary filters are hooked into the Indexing Service API. The problem with this is that it's a gigantic resource hog, and it isn't widely supported by third-party vendors"
Yes, agree. But these issues will be resolved
in a little while and resource consumption can
be reduced.
Abiola: "The big problem with all of the above functionality is that most of the really interesting stuff on our hard drives is stored in binary formats that traditional filters aren't good at indexing in a meaningful manner: for instance, no filter I know of will help you when you're looking for that 1 photo out of 15,000 all stored in your "My Pictures" directory. Then we've got files that are either stored in proprietary formats, or, like Postscript, are written in Turing-complete languages, which means there's no feasible algorithm that can ever be guaranteed to search through all of them in a finite time period (Halting Problem)."
Here I have to somewhat disagree. Images and
other media formats are a big problem, but
postscript etc IS NOT. While in theory some
of these data formats may be hard to process
in extreme cases, in practice Google already
does this using reasonable filters. Same for
most other non-media formats, at least in the
common cases. Supporting adapters for all these
is a hassle but not a really hard problem since
a few formats make up most of the data.
Abiola: "But this profiling of access patterns does nothing to help in just that scenario Brad mentions as being most vexing, i.e, when you don't even know you have something stored on your machine because it's been ages (if ever) since you last examined it. This is going to be an ever more pressing problem as storage costs continue to plummet."
Actually, I think it will help a lot because
human consumption of data is limited. There
just aren't that many images and files that
we can look at even over several years,
compared to the amount of data on disk or on the
web. OK, if you have never seen it, that is a
different story, but otherwise the fact that it
has been accessed, and in what context, can
provide a good filter, particularly for images
where no other usable data is available. (And
I was also thinking about the case of finding
stuff on the web that you have previously seen,
but can't remember where, which is another case)
Abiola: "A distributed system where one can call on the access profiles of potentially millions of individuals is likely to be much more comprehensive than anything based on a single individual's activity record"
Yes, but comprehensiveness is not the issue
in this case, at least in my approach. Yes,
aggregating information from many users for
prediction is another approach that is useful
for public documents. For personal files,
prior accesses are very useful in ranking.
Simple heuristics will work quite well here.
Regards,
Proffy
"Then we've got files that are either stored in proprietary formats, or, like Postscript, are written in Turing-complete languages, which means there's no feasible algorithm that can ever be guaranteed to search through all of them in a finite time period (Halting Problem)."
Abiola, do you understand the Halting Problem at all? Do you understand that it's a theoretical result about undecidability, not a practical fact about computer programming? Also, do you understand that the Halting Problem doesn't state that it's impossible to determine whether some specific algorithm terminates, but instead that it's impossible to devise a *general* method for determining whether some arbitrary algorithm will halt? And, please oh please tell me, why would whether a string-matching algorithm terminates depend *in any way* on whether its input represents a program written in some Turing-complete language?
Brad, Abiola clearly does not know what he's talking about. He's mentioning the Halting Problem because it sounds flashy and cool, not because it's in any way relevant to the topic at hand. It makes little sense for you to rely on him in the future for thoughts of any kind on computing. With so many outstanding minds among your colleagues in Berkeley's absurdly excellent CS department, I hope you'll skip Abiola's musings on computing in the future. (Or, at least, that you'll spare your loyal readers from having to skip over block quotes of his writing.)
Posted by: N on June 19, 2004 02:05 AM-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"Abiola, do you understand the Halting Problem at all?"
To say yes would be putting it mildly ...
"? Do you understand that it's a theoretical result about undecidability, not a practical fact about computer programming?"
You're the one who's lost if you say this ...
"And, please oh please tell me, why would whether a string-matching algorithm terminates depend *in any way* on whether its input represents a program written in some Turing-complete language?"
Because Postscript is a *Turing-complete programming language*, not just a markup language like HTML? Have you ever bothered to look at the raw content of a Postscript file? Ever wondered why Postscript viewers have so much trouble with, say, skipping to a particular page? A little preliminary homework would have saved you some embarassment.
"Brad, Abiola clearly does not know what he's talking about."
Pot. Kettle. Black.
"He's mentioning the Halting Problem because it sounds flashy and cool, not because it's in any way relevant to the topic at hand."
It may sound "cool" to you, but it's nothing of the sort to anyone with a basic undergraduate background in CS or mathematics, and it was enough of a problem for John Warnock (ever heard of him?) and company to go back to the drawing board and come up with Postscript.
To show that you have absolutely no clue what you're on about, here's a quote from a site that ought to know a little bit about these things,
http://www-cdf.fnal.gov/offline/PostScript/AdobePS.html
"There are no foolproof, omnipotent, programmable, programmed or universally simple methods for fixing Adobe PostScript files! There is a mathematical proof of the Halting Problem which is the reason why such methods do not exist. `Fixing' includes changing PS into EPS, making a file universally printable or viewable, rearranging or removing graphic elements from an image, etc."
Also see the following discussion:
http://lists.slug.org.au/archives/slug-chat/2004/05/msg00140.html
So much for the Halting Problem being merely "theoretical" in its problematic nature. If you knew what you were on about, you'd have been aware that arise from Postscript's Turing-completeness is a widely known issue amongst those who care about such things.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFA1CQEOgWD1ZKzuwkRAgwpAJ9xQfM1XZAcLLBJnBy08uRIu/aYBgCcD+TS
k8hDqdnPUPOQaeXV36IDmuQ=
=ssE4
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Err, that should have read "it was enough of a problem for John Warnock (ever heard of him?) and company to go back to the drawing board and come up with PDF."
The fact that PDF is *not* Turing-complete constitutes a big part of its appeal, as is explicitly recognized in the PDF standard documentation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32) - GPGshell v3.10
Comment: My Public Key is at the following URL:
Comment: http://www.alapite.net/pgp/AbiolaLapite.txt
iD8DBQFA1CTtOgWD1ZKzuwkRAiHfAJ9nTYXGR0zWTvG5Qhun1POTp1baYgCfSaZQ
cX7mbqBjmJVT4b15GoughCE=
=9am0
-----END PGP SIGNATURE-----
Abiola Lapite writes: ""There are no foolproof, omnipotent, programmable, programmed or universally simple methods for fixing Adobe PostScript files! There is a mathematical proof of the Halting Problem which is the reason why such methods do not exist. `Fixing' includes changing PS into EPS, making a file universally printable or viewable, rearranging or removing graphic elements from an image, etc.""
But that's a very different problem from merely stripping out text strings.
Yes, it is possible to write obfuscated Postscript that places glyphs algorithmically, and so doesn't contain any of the words that appear in the document.
But that's not how real-world Postscript-format documents are generated. In real-world applications where the user just wants the Postscript to look right, and for it to be generated quickly, the Postscript isn't going to bother with fancy obfuscation algorithms.
Abiola,
It makes no sense to say that gathering metadata is the "single [most] difficult problem in information retrieval and knowledge representation that I'm aware of" unless you define what metadata we are talking about.
Lets say that the metadata that I want o understand is the last time a file was accessed - not difficult at all to automatically gather.
Let's say you wanted to look at all cached HTML pages viewed when listening to a certain song -- not too difficult.
Let's say you want to view all photographs with your brother in them -- Manually, this easy but tedious. Automaticaly using facial recognition this is possible but fairly error prone. So this would still be a difficult but solvable problem.
Waht scenario are you thinking of, on a local computer, and what kind of metadata are you requiring to get what result?
Posted by: theCoach on June 19, 2004 09:36 AMTo make Abiola's point about PostScript easier to grasp note that there is a PostScript web server so that a seemingly innocent PostScript document might actually be passing along a completely separate ddocument of any size and not necessarily constant over time, possibly in a different format. To perform for PostScript as it would for an HTML document the indexer would have to take that in its stride.
In practice the indexer would most likely do as Jon says and just index it as a text document in the way that Google itself does.
I think that the major disagreement here is interpreting the ambition of the original question. If Brad wants to recover a known unknown, will indexing and improved storage capacity sort him out? Most likely. Will it find unknown unknowns on someone elses web site? Of course not.
Of course annotating files is a difficult problem. Even working out what it means is very difficult in the sense of there not being a good universal answer. Google certainly does a useful job but more impressive in quantityt than quality. It certainly is not that robust as you find out when you try to use it to find the web site of a specific hotel for example.
I can't agree that comprehensive indexing is currently impractically resource intensive on a PC. They simply don't generate enough data and have so many spare cycles that in my experience of using indexing switched on in Windows and DEVONThink on the Mac it is already not an issue.
Not having a filter for PDF as standard is probably more significant. On a corporate file server for example that would cahnge radically as the number of files to be indexed will be completely different as will managing permissions on indexes but storing index related meta data as a standard part of the file could well overcome that.
I think the difficult part is actually making it happen transparently. I imagine tools like the Kenjin bar that autonomy used to distribute, the Mono Dashboard project. These provide more sophisticated search funuctionality plus context sensitive related information from the computer and the internet. A well indexed and bemetadatad files sytem will have the data and resources to examine when and by whom something was created or viewed, what format it was in, what else was being looked at at around the same time, what documents are similar as measured by location, classification, linking and direct content. It will be able to make use of personal annotation and very specific search criteria and be able to make use of subject fingerprints gleaned from much larger data sets as well as looking for results on the internet and refining and previewing the results. It will have an interface that can handle this, the performance available to make it happen.
Microsoft can or will soon be able to do most of what Google does but initially won't be the default choice for people to use. Some kind of integration with on PC searching, for example constantly updated context related search, might be able to change that. The Microsoft threat would never just be WinFS on its own, even if content as opposed to annotation based search is underexploited by google and other search engines.
On the other hand backing up all the data will become very challenging as will making it all available.
Posted by: Jack on June 19, 2004 12:36 PMLooks like a marketing opportunity for Google or someone: backup and find your own stuff.
Posted by: James on June 20, 2004 01:46 AMJames,
Microsoft is going to invade Google's turf (internet search), and Google will inch toward Microsoft's (OS utility that searches).
One of the intriguing things about WinFS is that they will have some standard 'schemas' for many concepts, and there is a mechanism for allowing a domain consensus to emerge with regard to different types of documents.
Mining context data from existing formats becomes a cottage industry if there is a mechanism to leverage it.
Suggestion: Consider turning content indexing on for the disks on your Mac. Type "index" into finder help for instructions.
Posted by: Ben Hyde on June 20, 2004 12:16 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 6, 2004 02:37 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil on July 6, 2004 11:01 PMGreat site fatty lose weight with reductil and reductil uk
Posted by: reductil uk on July 8, 2004 03:22 PMGet it up mate, it's fun!
Posted by: Viagra on July 8, 2004 09:21 PMThis will get yours up again, dude!
Posted by: Viagra on July 12, 2004 03:09 AMIt gets yours up to the top dude! The girl will enjoy it!
Posted by: cialis uk on July 13, 2004 07:20 AMI don't really think your thoughts are right. Maybe you need a loan?
Posted by: Loans on July 15, 2004 12:55 PMIf you're looking for Kontaktanzeigen online, check the blog!
Posted by: Kontaktanzeigen mit Bild on July 16, 2004 06:29 AM