Thursday, April 14, 2005

Rudimentary context, con't

Ah work, the curse of the drinking man. Back to the interesting stuff.

I see Scott has responded to my blog. When I said that the query had to replicate the precise document, what I mean is to get a one hit response to a query on a large data set. The more words in the query will result in a diminishing number of responses, until at one point the query is large enough to return the one document that matches. This characteristic makes a simple index and search function essentially useless for most desktop purposes. What google does is give an importance or relevance score which orders the large number of results. I assume that the goal isn't to provide many matches, but provide a few useful ones.

As for computer resources, I once indexed my email collection. It took forever. By definition an indexing and context scheme will be most useful to those who have large amounts of data. That being said, a machine two years old has enormous computing power, most of it unused.

Again I profess no special knowledge in this stuff. I'll continue using my terminology to avoid getting confused.

So back to the mechanics. Once there exists an index of tokens then some interesting analysis can be done. Two goals. First to have a way of identifying common tokens. Say you are a young man, and your interests are evident in the fact that almost all documents have 'Farrah' in them. Any query on 'Farrah' will return all or almost all documents. So for our purposes, 'Farrah' is a common word, at least until the fashion changes and the word 'Angelina' shows up in all documents. So tokens are scored by how often they exist in documents. An arbitrary cutoff could be determined to give us a list of common words. All the other tokens are similarly scored, with the rarely seen words or tokens having a low score, and visa versa. Structured documents give us a problem and an advantage here. HTML or email files have tags that occur in each document, so in the parsing these tags would need to excluded from the scoring.

Now you would create a document -> token index. From that index you can then begin to discern a way to identify a document. One way is to select the tokens that have the lowest score in the token->document index. Remember tokens could include synonyms. The document becomes identified by the smallest number of low score tokens that would differentiate it from other documents. Hence a third index. (imagine using this to automatically create a document name instead of prompting for a filename). The advantage of structured documents can be used to nudge up the importance of specific tokens, ie. subject line text.

Now to find similar documents to discern contexts. Starting with the lowest scoring token in a given document, using the links in the token->document index, you come up with a list of similar documents. Then go to the next higher scoring token, get documents, and the list gets a bit smaller. rinse lather repeat. Bingo you have similar documents that form a context. Some arbitrary top and bottom thresholds would help avoid seeing patterns when there is none, and missing obvious ones.

Remember that tokens include other metadata such as time, mimetype, and the actual subject of the document. Preliminary matches can then be checked using the other metadata.

Realistically the similarities in most of what goes through a multipurpose desktop will be somewhat contrived, and of limited interest. But we do have patterns of activity. And we all go off on tangents, doing research on a particular subject, or doing a specific project. For our machine to recognize such things as contexts and allow the applications to use this information to assist us would be very useful.

I'm sure there are serious holes in my approach. How would you deal with the common situation where similar work is prepared regularly? My scheme tends towards the recognition of the unusual or differentiating elements. And of course the question of whether all this could be done in a reasonable amount of time and space. I do agree with Scott. I don't think that this is impossible or even impractical. The devil will indeed be in the details.

Now I must go and read what Tenor actually is :)


Comments:
you're describing the classic "term frequency / inverse document frequency" ranking scheme that is very common in modeling techniques.

it's actually covered right near the beginning of Chapter 2 of the book Scott referenced in his blog "Modern Information Retrieval".

statistical modeling based on content will certainly be useful and interesting, just remember that it's augmented by non-content based contextual information.

i actually consider the latter to be of even higher value than the former.
 
Post a Comment

Subscribe to Post Comments [Atom]





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]