Wednesday, June 23, 2004
This is getting interesting...
Nassib Nassar, who wrote Amberfish, an indexing/search engine, wrote a fascinating primer about the field of Information Retrieval, which is the study and application of data indexes and searching. As I mentioned earlier Mark Bucciarelli is planning on using this engine to provide a search facility for KHelpCenter. Here are some snips from his email:
IR is a major field and goes back to the 1950s. Relevance ranking and related work on similarity goes back almost that far. Similarity is measured using statistical methods. The common approaches are usually based on either vector space models or probability theory. There is also "latent semantic indexing" which gives really good results. It creates huge matrices of term-by-document relationships like you mentioned; it's very slow and those people seem to work often on things like dimensionality reduction.
Networked IR made a lot of progress in the early 90's. Before the web, most text data were at large libraries, repositories, and vendors, like the Library of Congress, Lexis-Nexis, etc. These organizations got together and developed an open standard called ANSI/NISO Z39.50 (ISO 23950), which defines a client-server protocol for networked/distributed IR. Z39.50 is *very* comprehensive and well thought out. Unfortunately it was a complex standard and had the misfortune of maturing at the moment TimBL's work took off. So it was overshadowed by web searching. Most people forgot about "deep searching" of document collections, being so amazed by web searching, though the big libraries and repositories still use Z39.50. A small group of us from those IETF working groups are still involved with distributed searching standards, trying to get it into the grid infrastructure, for one thing. I think this is an area where Linux could leapfrog Microsoft, because we are less paranoid about owning everything and may be able to look slightly longer term (if we are willing to). Implementing search on the desktop is a step in this direction, because it would put most of the software infrastructure in place.
There is much more.
Now to start trying out the software. I've got an archive of kde-cvs posts which I am indexing right now. I want to explore a little to see what is possible.
Subscribe to Posts [Atom]