Saturday, May 26, 2007

Delving into libkscan

Well, I've got the caller id module almost done. I haven't written the plugin and dbus stuff yet though. My mind keeps straying to scanning and ocr, so I better follow.

Soon after posting the last entry, I was contacted by a fellow who is working on KTiny, a kde frontend to TinyERP. It seems that some means of scanning and ocr'ing invoices has some interest. Gamera is also working on this.

Since libkscan already is written, I figured I should use it. I am now building a kde4 setup so I can link to the libraries. For my purposes I don't need a complicated scanning application, just something that scans at predetermined settings, saving the image. Preferably it would just be a matter of loading the scanner and pressing a button. Tesseract will do the ocr reliably, but not yet return the coordinates of the text. Someone has done a dll for windows, but not released the source. I'm hoping that tesseract will be fixed by the time I need to start experimenting with it.

My specific needs of document recognition are reasonably well defined. Invoices have things in common; a date, a number of some kind, a vendor identification, terms and shipping stuff, then a list of items showing quantity, description, shipped or back ordered, price, discount, total. Or simply a description and total. My user audience would typically use 5 or 6 major vendors, and maybe twice that in minor vendors. In other words, most of the paper going through would be very similar. I'm not sure if this would make sense, but if you had the text, the coordinates of the text, and a couple of examples from a vendor to get an idea of what changes and what stays the same across invoices, some logic could probably extract the desired information. We shall see.

Monday, May 21, 2007


Seems that things are moving along quite quickly. Ocropus is an open source document analysis and OCR system. It uses tesseract as ocr, and a bunch of other stuff for statistical analysis, aspell for spell checking, etc. Not even alpha yet though.

I'm building it right now.

Hmm. How hard would it be to write a scan -> pdf generator?

I think Google is paying people to work on this stuff.

Glory Be! or Finally Working Free OCR!

I hope this post doesn't spam planetkde. If it does, enjoy my enlightened opinions once again. And blame Google.

The project I'm working on will gather information from numerous sources with the goal of tracking tasks from inception to invoicing. That is the secondary goal, the primary being learning Qt and C++, and keeping an interesting challenge floating in the back of my mind. So I've recently worked on reading caller id data from a modem, using threads, mutexes, communicating this to the mother ship using dbus, plugin interfaces and all the neat stuff that Qt 4 provides. Great fun. Next, I wanted to set something up that would scan and ocr supplier invoices. Scanning is the first challenge, although there is a libscan in KDE. Maybe that will be the impetus to migrate from Qt to KDE4. Which has been my intention all along.

I have been watching gocr and ocrad for a while. They are quite a ways from being useful. I started considering using wine and some windows tools. Ugh. I ran across tesseract, the ocr tool originally from HP that was freed and Google picked up. It works. I have a few scans I was using for tests, in pnm format. Tesseract requires TIFF, so I did a conversion, and tried the ocr. Very nice. There are a few errors, mostly in areas where the font was small and blurry. But it definitely works. So now I can scan documents, ocr them, use QScript to grab the important data.

This really means I don't have any more excuses. I've got to get this thing to the point where I can begin using it.

In the May 14, 2007 LugRadio podcast, there was a discussion of what needed fixing in the Linux Desktop. Someone suggested that it was already there. I spluttered and fumed as I listened, thinking where is OCR! No longer. With working OCR, the next level of tools such as ocr->pdf, and other neat stuff will come along. Great.

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]