Scan and automatically OCR receipts, bills, letters, etc. without turning on your computer

I was looking for a solution to digitally archive incoming (paper) bills, receipts, letters, etc. to make them easily searchable (OCR). Ideally, this solutions would be as simply as putting a document on your scanner and pressing a button—done: the digitized document would be stored on a network drive. I really don’t like having to turn on your computer, opening up the scanner software, starting the OCR, saving the file, etc.

Existing Solutions

There are a few interesting scanning solutions out there, but I found them either too expensive or missing some functionalities. Here are some examples, they might work for you:

Doxie Go: looks promising (“scan anywhere”), but I couldn’t figure out if you’d be able to scan directly to a network drive; also, it is sort of pricey (> 200 EUR for the Wifi version)
Fujitsu ScanSnap iX500: even more expensive (close to 400 EUR)

Solution Based on NAS + All-In-One Scanner-Printer

In the end, I decided to use my existing NAS (a Synology DS213) to both OCR and store documents. I bought a HP Officejet Pro 8160 as scanner because it can directly scan to network drives—and priced at about 130 EUR it is cheaper than the other scanners (while being a printer and fax at the same time).

Usage

The work flow goes like this:

Put a document on the scanner glass (or use the scanner’s document feeder)
Using the scanner’s touch screen, select “scan to network” and the destination drive (pre-configured via the web interface)

What happens in the background:

The scanner saves the document as JPG on a shared network folder on the NAS
A script on the NAS is watching this shared folder:
- Once a new scan appears, it runs OCR and saves the scan as searchable PDF
- It also tries to extract a date (e.g., from an invoice) to append to the file name

Installation of OCR Package on the NAS

Setting up OCR was not as easy as I had hoped for, but worked out in the end. Here is a how-to:

login via SSH on the Diskstation
install the IPKG package manager (instructions)
install tesseract-OCR from source (to get the latest version)
- install GCC via package manager
```
ipkg install gcc
```
- install leptonica, autoconf, automake, libtool (use 2.4.5, 2.4.6 did not work) from source
- install tesseract from source
  - run
```
./autogen.sh
```
  - dirty workaround to make it compile: in ccutils/helpers.h: put smaller numbers in lines 64 and 65 (e.g., 63641362 and 14426950)—I did not investigate, but using a different seed for the random generator does not seem to be critical for an OCR application…
  - fix pthread issue (original post)
    - backup the pthread libraries found in /opt/arm-none-linux-gnueabi/lib/
```
mkdir /opt/arm-none-linux-gnueabi/lib_disabled
mv /opt/arm-none-linux-gnueabi/lib/libpthread* \
   /opt/arm-none-linux-gnueabi/lib_disabled
```
    - copy the pthread libraries found in /opt/lib
```
cp /lib/libpthread.so.0 /opt/arm-none-linux-gnueabi/lib/
cd /opt/arm-none-linux-gnueabi/lib/
ln -s libpthread.so.0 libpthread.so
ln -s libpthread.so.0 libpthread-2.5.so
```
  - run
```
./make
./make install
```
  - download tessdata to /usr/local/share/tessdata
  - edit /etc/profil and add
```
export TESSDATA_PREFIX=/usr/local/share/tessdata/
```
configure tesseract to also create text files (used for date extraction below) by editing /usr/local/share/tessdata/configs/pdf as follows
```
tessedit_create_txt 1
tessedit_create_pdf 1
tessedit_pageseg_mode 1
```

test with sample JPG file (this example uses a German dictionary)

tesseract scan.jpg outfile -l deu /usr/local/share/tessdata/configs/pdf

Setting up Folder Watch

I’m using inotifywait to detect when new files are added to the watched folder. I created a shell script to start and stop watching. Note that everything is configured for German documents and German date format.

Additional Configuration Steps

Configuring the scanner to scan to a specific shared network folder is straightforward using the scanner’s web interface.
Likewise, sharing a folder from the NAS is straightforward using its excellent web interface.

Tweet This Post Post to Facebook Send Gmail

Scan and automatically OCR receipts, bills, letters, etc. without turning on your computer

Existing Solutions

Solution Based on NAS + All-In-One Scanner-Printer

Usage

Installation of OCR Package on the NAS

Setting up Folder Watch

Additional Configuration Steps

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112