Saturday, April 30, 2005

The Importance of Being Eloquent

This Safari khtml saga has brought a smile to my face. At my workplace we take great pleasure in abusing sales and marketing people, and to think of the discomfort at Apple who should be basking in praise for their upcoming software release... Zack does have impeccable timing. And the suggestions that it isn't so much the availability of the patches as the quality of the code. Tee hee.

Some of the misconceptions may be due to my infernal persistance in publishing the commit logs. People have the privilege of viewing the progressive work and reading the comments from the developers. Take advantage of the opportunity to make the impression you want. As your faithful sycophant, I will publish your pronouncements to the teeming masses.

Unless of course this was a brilliantly executed strategy hatched at kde-dirtytricks.org.


The Importance of Being Eloquent

This Safari khtml saga has brought a smile to my face. At my workplace we take great pleasure in abusing sales and marketing people, and to think of the discomfort at Apple who should be basking in praise for their upcoming software release... Zack does have impeccable timing. And the suggestions that it isn't so much the availability of the patches as the quality of the code. Tee hee.

Some of the misconceptions may be due to my infernal persistance in publishing the commit logs. People have the privilege of viewing the progressive work and reading the comments from the developers. Take advantage of the opportunity to make the impression you want. As your faithful sycophant, I will publish your pronouncements to the teeming masses.

Unless of course this was a brilliantly executed strategy hatched at kde-dirtytricks.org.


Tuesday, April 26, 2005

Digest test server

Almost everything works as expected. A minor issue with the diffs, but otherwise all is well.

Here is an issue that has about 1/4 of the commits referring to svn. http://cvs-digest.org:8080/?issue=mar122005. This is otherwise the same as the march 11 issue. The trunk module has all of the svn ones. It looks almost the same as the cvs one.

The digest and diff classes call a query class that instantiates either the cvs or svn class depending on version number. The scm descendant classes do the necessary magic to fetch data from either repository, returning an array with the same structure. The diff class was refactored to simplify things. The various parts of the diff, such as graphics, sound, unified now all inherit from a parent that defines the common behaviour. This has helped keep consistency in the views. The image diff used to checkout the image, cache it to a file then the img src was the cache file. Now the img src is a call to the transfer class that checks out the image. A few other cleanups were done to tidy up some messes and remove old code.

With subversion support I can recommend this to anyone else who wants to produce a commit digest for their project. To work with cvs required substantial hacks, a local repository and a bunch of headaches. With subversion, the repository can be remote or local.

The test server has caching disabled, so I've disabled the all in one page. It takes too long to generate it, and no use trashing the server too badly. Expect some delays as the scripts fetch the data from svn.kde.org. Once this code is in production, caching will make a huge difference.


Friday, April 22, 2005

KDE CVS|SVN-Digest

I guess I've got to rename the whole mess. KDE Commit Digest. There.

The scripts that generate the digest are almost rewritten to accommodate subversion. The scm is determined by the revision number (1.2 vs 443245). Either scm can be read from to generate the digest lists or diffs. This means older issues will still be available. I will keep a local copy of the cvs repository for that purpose. I set up a test server and am in the process of bringing it all together. This gives me a chance to clean up some cruft that accumulated. The statistics script isn't written yet, nor have I set up a local mirror for performance.

Once I've got something working, I will post the url. There won't be any visible difference to the readers, just different backends.


Thursday, April 14, 2005

Rudimentary context, con't

Ah work, the curse of the drinking man. Back to the interesting stuff.

I see Scott has responded to my blog. When I said that the query had to replicate the precise document, what I mean is to get a one hit response to a query on a large data set. The more words in the query will result in a diminishing number of responses, until at one point the query is large enough to return the one document that matches. This characteristic makes a simple index and search function essentially useless for most desktop purposes. What google does is give an importance or relevance score which orders the large number of results. I assume that the goal isn't to provide many matches, but provide a few useful ones.

As for computer resources, I once indexed my email collection. It took forever. By definition an indexing and context scheme will be most useful to those who have large amounts of data. That being said, a machine two years old has enormous computing power, most of it unused.

Again I profess no special knowledge in this stuff. I'll continue using my terminology to avoid getting confused.

So back to the mechanics. Once there exists an index of tokens then some interesting analysis can be done. Two goals. First to have a way of identifying common tokens. Say you are a young man, and your interests are evident in the fact that almost all documents have 'Farrah' in them. Any query on 'Farrah' will return all or almost all documents. So for our purposes, 'Farrah' is a common word, at least until the fashion changes and the word 'Angelina' shows up in all documents. So tokens are scored by how often they exist in documents. An arbitrary cutoff could be determined to give us a list of common words. All the other tokens are similarly scored, with the rarely seen words or tokens having a low score, and visa versa. Structured documents give us a problem and an advantage here. HTML or email files have tags that occur in each document, so in the parsing these tags would need to excluded from the scoring.

Now you would create a document -> token index. From that index you can then begin to discern a way to identify a document. One way is to select the tokens that have the lowest score in the token->document index. Remember tokens could include synonyms. The document becomes identified by the smallest number of low score tokens that would differentiate it from other documents. Hence a third index. (imagine using this to automatically create a document name instead of prompting for a filename). The advantage of structured documents can be used to nudge up the importance of specific tokens, ie. subject line text.

Now to find similar documents to discern contexts. Starting with the lowest scoring token in a given document, using the links in the token->document index, you come up with a list of similar documents. Then go to the next higher scoring token, get documents, and the list gets a bit smaller. rinse lather repeat. Bingo you have similar documents that form a context. Some arbitrary top and bottom thresholds would help avoid seeing patterns when there is none, and missing obvious ones.

Remember that tokens include other metadata such as time, mimetype, and the actual subject of the document. Preliminary matches can then be checked using the other metadata.

Realistically the similarities in most of what goes through a multipurpose desktop will be somewhat contrived, and of limited interest. But we do have patterns of activity. And we all go off on tangents, doing research on a particular subject, or doing a specific project. For our machine to recognize such things as contexts and allow the applications to use this information to assist us would be very useful.

I'm sure there are serious holes in my approach. How would you deal with the common situation where similar work is prepared regularly? My scheme tends towards the recognition of the unusual or differentiating elements. And of course the question of whether all this could be done in a reasonable amount of time and space. I do agree with Scott. I don't think that this is impossible or even impractical. The devil will indeed be in the details.

Now I must go and read what Tenor actually is :)


Rudimentary context.

This isn't tenor, I haven't looked at the code. Search and pattern matching is a fascinating intellectual exercise, and here is the product of my feeble ruminations.

First some definitions. A token is a word, url, filename, mimetype, time, and all the other metadata that you would like to know. A document is a collection of tokens, most obviously an email, a web page, a pdf, a file that you composed, an appointment, etc. The list is large and field for much thought.

So you have a large collection of documents on your hard drive, coming in and going out over network interfaces. First is to build an index, token -> document. A token query would return a list of documents. Here some magic to deal with verb tenses, plurals and possibly synonyms would be helpful. In the end, you would provide a query with a token or list of tokens, getting back a list of documents. To find the exact document would require replicating it in the query.

Getting to this point is to use two or three decades worth of research into document retrieval systems. Still not enough for our purposes.

Take this interlude to upgrade storage and processor capacity. You'll need it.

To be continued...


Tuesday, April 12, 2005

Reality intrudes

I heard a funny story the other day. It is true, I heard it from the person himself. This fellow owned a car shop that specialized in transmissions. He had a transmission mechanic who didn't seem to cooperate with his ideas for growing the company, so he got rid of him. His transmission business died, because he didn't have anyone who could fix them. He now owns a fast food restaurant.

This otherwise bright and motivated individual lacked the wisdom to see that the value of his business wasn't his marketing brilliance, the building or even a client base. It was his technically skilled employees, who by definition (I speak from experience here) are hard to deal with, independant, notional, and have respect for no man. His mechanic probably had half a dozen current job offers, in other words, he needed his mechanic more than his mechanic needed him. Reality intruded.

Coding is fun. Technical challenges are fun. Making a living coding or solving technical problems is too often not fun. Many have opined on the motivations behind those who contribute to free software. I firmly believe that the greatest motivation comes from developers taking control of their industry.

It is interesting to see the efforts to pull the control of free software projects away from the developers. They will fail miserably. There are hundreds of interesting and challenging projects that would appreciate skilled developers.

On another subject. I installed Kubuntu last week, replacing a three year old gentoo installation that had grown to fill 25 gigs of hard drive space. I needed something quickly, and indeed the Ubuntu install works well. A few comments. Ubuntu is still an administrators installation. No one should be expected to fiddle with permissions to get sound working. Offering a limited range of supported software is doomed to failure. I suspect almost every Ubuntu installation has expanded their sources.list file. And why in blue blazes, in this day and age, when I install a package, doesn't it show up in the menu?

I lived with a raw KDE installation for a long time. Every failing I find in KUbuntu is either outside of KDE or due to a change to the KDE defaults.

I am seriously contemplating going back to Gentoo. If only it could be done in a couple of hours...


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]