Friday, June 25, 2004
Insane week
Very crazy week. We've had record warm temperatures. So very busy fixing air conditioners.
Saw an interesting application at the Castlegar Fire Center, which is a tanker base, dispatch and logistics center for the southeast of British Columbia forest fire suppression effort. Their a/c quit which makes me popular when I show up to fix. Anyways, they have an application that superimposes the active fires and lightning strikes over the BC map. We seem to be in a weather pattern where there are thousands of lightnings strikes each evening. It is raining and thunder right now. Great. A good start for the summer.
Wednesday, June 23, 2004
This is getting interesting...
Nassib Nassar, who wrote Amberfish, an indexing/search engine, wrote a fascinating primer about the field of Information Retrieval, which is the study and application of data indexes and searching. As I mentioned earlier Mark Bucciarelli is planning on using this engine to provide a search facility for KHelpCenter. Here are some snips from his email:
IR is a major field and goes back to the 1950s. Relevance ranking and related work on similarity goes back almost that far. Similarity is measured using statistical methods. The common approaches are usually based on either vector space models or probability theory. There is also "latent semantic indexing" which gives really good results. It creates huge matrices of term-by-document relationships like you mentioned; it's very slow and those people seem to work often on things like dimensionality reduction.
Networked IR made a lot of progress in the early 90's. Before the web, most text data were at large libraries, repositories, and vendors, like the Library of Congress, Lexis-Nexis, etc. These organizations got together and developed an open standard called ANSI/NISO Z39.50 (ISO 23950), which defines a client-server protocol for networked/distributed IR. Z39.50 is *very* comprehensive and well thought out. Unfortunately it was a complex standard and had the misfortune of maturing at the moment TimBL's work took off. So it was overshadowed by web searching. Most people forgot about "deep searching" of document collections, being so amazed by web searching, though the big libraries and repositories still use Z39.50. A small group of us from those IETF working groups are still involved with distributed searching standards, trying to get it into the grid infrastructure, for one thing. I think this is an area where Linux could leapfrog Microsoft, because we are less paranoid about owning everything and may be able to look slightly longer term (if we are willing to). Implementing search on the desktop is a step in this direction, because it would put most of the software infrastructure in place.
There is much more.
Now to start trying out the software. I've got an archive of kde-cvs posts which I am indexing right now. I want to explore a little to see what is possible.
Tuesday, June 22, 2004
Further search
Mark Bucciarelli wrote in response to my Useful indexing blog.
I wanted to let you know about the AmberFish project, which I discovered last week. I'm hoping to hack a kpart wrapper this summer. I've had some discussion with the author, and he is a KDE user and willing to do what he can to help.
It looks like a very good start. It's C/C++, fast, and allows incremental adding of documents to the index.
My original thought is to integrate it into KHelpCenter. First for KDE docs, and then expand the search domain to include other directories; for example, /usr/share/doc, and user-specified directories).
Ah. In the spirit of free software, find something someone has written, and use that. As I thought about indexing schemes, I could see it becoming a huge endeavor.
Saturday, June 19, 2004
Useful indexing
Jon Udell writes about the possibilities of full scale indexing of the desktop computer.
Context reassembly: When writing a report, you’re likely to refer to a spreadsheet, visit some Web pages, and engage in an IM chat. Using its indexed and searchable event stream, the system would restore this context when you later read or edited the document. Think browser history on steroids.
Screen pops: When you receive an e-mail, IM, or phone call, the history of your interaction with that person would pop up on your screen. The message itself could be used to automatically refine the query.
He is speculating on the possibilities if Google crawled and indexed all the data on our machines. Ignoring the privacy issues for a moment, having an indexed db of all the data that has come and gone in our machine would be very useful. When preparing the Digest, I regularly think back to conversations on icq, or some email I saw, that I wish I could find. Essentially a drill down google search on data that I have already seen and read.
The Dashboard ideas that have been floating around for a while are an attempt at this. The difficulty with indexing large amounts of data is being able to ignore the fluff, and pick out the important bits to highlight. I do that by hand when reading through the kde-cvs list emails.
The unified environment of KDE presents an opportunity to do this right. Each application could have hooks to an optional indexing library. And some way of presenting the found data that is unobtrusive but useful.
Of course, it comes back to the machine intelligence.
Free speech
As someone who regularly says outrageous things, I value the right. I regularly say things others vehemently disagree with. Makes for good fun.
So now it seems Microsoft has a corporate strategy to stifle negative comments about themselves. No fooling either. Let's get a judge to shut them up.
Quoting from Lawrence Lessig's Blog,
Apparently Microsoft has taken the first steps to filing a criminal defamation action against a Brazilian government official who was quoted criticizing Microsoft in a magazine article. Sergio Amadeu, head of the agency responsible for spreading free software within the Brazilian government, is reported to have accused “the company of a 'drug-dealer practice' for offering the operational system Windows to some governments and cities for digital inclusion programs. 'This is a trojan horse, a form of securing critical mass to continue constraining the country'.
Maybe Microsoft would like to hear a little bit of free speech. Here is their corporate feedback page.
Friday, June 18, 2004
Waterloo
Sunday, June 13, 2004
Clean up
I publish the KDE CVS-Digest
Quite productive yesterday. Cleaned up the issue loading, put in some error handling. Also it is possible to view the older issues of the Digest. I wrote a routine that changes keywords in the summary to urls. This takes me about 10 minutes each time I publish. Now to edit the older ones and have them regenerated automatically.
The ability to view images is quite popular. I would say 4 out of 5 diff calls are to an image. Every 5th approximately viewer looks at a diff. I'm definitely preaching to the converted, as most of the views are from Konqueror.
I'm at a bit of a loss as to what to do next. Some ideas I've been kicking around for a while include:
Long (whole file) diffs. This probably will happen soon. Graphing of some statistical data. Breaking up the rather lengthy digest into smaller sections. Repository browsing. Some site statistics would be interesting.
Subscribe to Posts [Atom]