Thursday, April 14, 2005
This isn't tenor, I haven't looked at the code. Search and pattern matching is a fascinating intellectual exercise, and here is the product of my feeble ruminations.
First some definitions. A token is a word, url, filename, mimetype, time, and all the other metadata that you would like to know. A document is a collection of tokens, most obviously an email, a web page, a pdf, a file that you composed, an appointment, etc. The list is large and field for much thought.
So you have a large collection of documents on your hard drive, coming in and going out over network interfaces. First is to build an index, token -> document. A token query would return a list of documents. Here some magic to deal with verb tenses, plurals and possibly synonyms would be helpful. In the end, you would provide a query with a token or list of tokens, getting back a list of documents. To find the exact document would require replicating it in the query.
Getting to this point is to use two or three decades worth of research into document retrieval systems. Still not enough for our purposes.
Take this interlude to upgrade storage and processor capacity. You'll need it.
To be continued...
Subscribe to Posts [Atom]