I was looking for a solution to digitally archive incoming (paper) bills, receipts, letters, etc. to make them easily searchable (OCR). Ideally, this solutions would be as simply as putting a document on your scanner and pressing a button—done: the digitized document would be stored on a network drive. I really don’t like having to turn on your computer, opening up the scanner software, starting the OCR, saving the file, etc.
Existing Solutions
There are a few interesting scanning solutions out there, but I found them either too expensive or missing some functionalities. Here are some examples, they might work for you:
- Doxie Go: looks promising (“scan anywhere”), but I couldn’t figure out if you’d be able to scan directly to a network drive; also, it is sort of pricey (> 200 EUR for the Wifi version)
- Fujitsu ScanSnap iX500: even more expensive (close to 400 EUR)
Solution Based on NAS + All-In-One Scanner-Printer
In the end, I decided to use my existing NAS (a Synology DS213) to both OCR and store documents. I bought a HP Officejet Pro 8160 as scanner because it can directly scan to network drives—and priced at about 130 EUR it is cheaper than the other scanners (while being a printer and fax at the same time).
Usage
The work flow goes like this:
- Put a document on the scanner glass (or use the scanner’s document feeder)
- Using the scanner’s touch screen, select “scan to network” and the destination drive (pre-configured via the web interface)
What happens in the background:
- The scanner saves the document as JPG on a shared network folder on the NAS
- A script on the NAS is watching this shared folder:
- Once a new scan appears, it runs OCR and saves the scan as searchable PDF
- It also tries to extract a date (e.g., from an invoice) to append to the file name
Installation of OCR Package on the NAS
Setting up OCR was not as easy as I had hoped for, but worked out in the end. Here is a how-to:
- login via SSH on the Diskstation
- install the IPKG package manager (instructions)
- install tesseract-OCR from source (to get the latest version)
- install GCC via package manager
ipkg install gcc
- install leptonica, autoconf, automake, libtool (use 2.4.5, 2.4.6 did not work) from source
- install tesseract from source
- run
./autogen.sh
- dirty workaround to make it compile: in ccutils/helpers.h: put smaller numbers in lines 64 and 65 (e.g., 63641362 and 14426950)—I did not investigate, but using a different seed for the random generator does not seem to be critical for an OCR application…
- fix pthread issue (original post)
- backup the pthread libraries found in /opt/arm-none-linux-gnueabi/lib/
mkdir /opt/arm-none-linux-gnueabi/lib_disabled mv /opt/arm-none-linux-gnueabi/lib/libpthread* \ /opt/arm-none-linux-gnueabi/lib_disabled
- copy the pthread libraries found in /opt/lib
cp /lib/libpthread.so.0 /opt/arm-none-linux-gnueabi/lib/ cd /opt/arm-none-linux-gnueabi/lib/ ln -s libpthread.so.0 libpthread.so ln -s libpthread.so.0 libpthread-2.5.so
- backup the pthread libraries found in /opt/arm-none-linux-gnueabi/lib/
- run
./make ./make install
- download tessdata to /usr/local/share/tessdata
- edit /etc/profil and add
export TESSDATA_PREFIX=/usr/local/share/tessdata/
- run
- install GCC via package manager
- configure tesseract to also create text files (used for date extraction below) by editing /usr/local/share/tessdata/configs/pdf as follows
tessedit_create_txt 1 tessedit_create_pdf 1 tessedit_pageseg_mode 1
- test with sample JPG file (this example uses a German dictionary)
tesseract scan.jpg outfile -l deu /usr/local/share/tessdata/configs/pdf
Setting up Folder Watch
I’m using inotifywait to detect when new files are added to the watched folder. I created a shell script to start and stop watching. Note that everything is configured for German documents and German date format.
Additional Configuration Steps
- Configuring the scanner to scan to a specific shared network folder is straightforward using the scanner’s web interface.
- Likewise, sharing a folder from the NAS is straightforward using its excellent web interface.