1.1 The Plan

If you speak german, you might want to read the following thread and especially the linked entries:

The plan is to have a scanner running connected to a server, where a service is running that will scan pages, OCR them, pack a set of pages into a PDF and index the whole thing so a web interface can access this. Special keywords with the index engine and you can easily have facets.

Example: one comes home, opens his/her mail, sends all business mail through the scanner and is done with it. Dream: scan ocr'ed data for dates and create reminders in google calendar or so.

Another plan might be to have a universal (java?) GUI client with OCR and multi page PDFs. Or SANE support. Or the well known cam scanner apps incorporate scanning from the scanner (since they already have OCR and PDFs).

1.2 Requirements

What is required for this:

  1. Get the scanner into our network. On demand. So you can turn it on, scan, and turn it of.
    Since the scanner is a WiFi AP, a WiFi dongle is required that will try to connect to the scanner. The scanner connection should not disable normal network traffic, though. I wrote a low level guide for linux. Network manangers should make it easier. Windows should work, too (todo: metric). No matter how, it's necessary to firewall the connection against incoming traffic due to the AP being easily spoofable.
  2. Have a service running that connects to the scan service, looks if there is document inserted and scans it.
  3. Check if the image is a special image that triggers a "delete last page", "document finished" event. This can be done via special barcodes that might be laminated for durability. Barcode libraries are available.
  4. Improve image. Contrast. Rotation. While contrast is easily done, imagemagick and fix rotation scripts etc might come in handy. The scanner has a not so nice auto crop function that will cut borders but is initialized on the first few lines of the image so it usually cuts too much :(
  5. OCR can be done with free cloud services or free programms like tesseract. Tesseract needs some ascii garbage filter, though ;)
  6. PDF it. Libraries available
  7. Index it. Lucene? Dunno yet.
  8. Web interface. Dunno yet.
  9. Feedback. How?