Home / Culture and Society / Science and Technology / Open Source: Cheap Open Source Document Imaging

Open Source: Cheap Open Source Document Imaging

Please Share...Print this pageTweet about this on TwitterShare on Facebook0Share on Google+0Pin on Pinterest0Share on Tumblr0Share on StumbleUpon0Share on Reddit0Email this to someone

Like many companies and individuals, I still have to deal with paper. Even though it’s becoming an increasingly digital world, paper still reigns supreme in some areas. I have looked at document imaging systems in the past when I worked at another company, and found them to be quite expensive. Below is my open source solution to this common problem, and it only cost me hardware.

Step 1: Find a good scanner

Sometimes it’s hard to get good hardware that is supported in Linux. Many major manufacturers are now offering Linux drivers and Brother happens to be one of them. They have a line of multi-function machines that work very well in Linux with their supplied driver (at least in my experience). The scanner I use is the MFC5440CN. The reason I use this particular model is because it has a 35 page automatic document feeder (ADF). That may not seem like very much capacity, but if you have a longer document you can add pages to the ADF while it’s scanning. In addition to the ADF, the scanner can be acquired for little money; Amazon has it listed for $123.49 with free shipping. A USB cable will have to be acquired separately (if you don’t already have one) because it is not included with the scanner.

Step 2: Install drivers & software

This part is incredibly simple compared to some other devices I’ve tried to use, partially because of the simple package management in Debian Linux and partly because Brother supplies a good open source driver. You can simply navigate to their page that contains Linux drivers and download the appropriate one for your distribution. I used the brscan2 Debian installer package.

After the driver is downloaded, it can be installed by typing (as root) “dpkg -i brscan2-0.0.2-1.i386.deb” (without quotes). The only other software required for this is SANE, ImageMagick, and pdftk. In Debian Linux these are easily installed by running “apt-get install sane imagemagick pdftk” (again, without quotes).

Now that the driver and software is installed, some configuration may be necessary in /etc/fstab depending on your configuration. If you already have a line for /proc/bus/usb in your /etc/fstab then you will need to modify it with your favorite text editor (VIM) to read:

(users with 2.4.x kernels use usbdevfs instead of usbfs)
none /proc/bus/usb usbfs auto,devmode=0666 0 0

If you don’t have a line like that in your /etc/fstab you can add it by typing as root:

(users with 2.4.x kernels use usbdevfs instead of usbfs)
echo ‘none /proc/bus/usb usbfs auto,devmode=0666 0 0’ >> /etc/fstab

Next, change USB access control:

umount /proc/bus/usb; mount /proc/bus/usb; mknod -m 666 /dev/usbscanner c 180 48

The scanner should now work properly.

Step 3: A Small Script

I wrote a small script to do all the processing for me. This way, I can just load some pages into the ADF and run the script. It will scan everything in, convert the images to pdf’s, concatenate the individual pdf’s, and then delete all the temporary files. It is pasted below:


#Automatic scan/conversion script
#Requires sane, imagemagick, and pdftk

#Scan in the pages
scanadf –mode “Black & White” –resolution 200

#Convert each page to a pdf file and delete the original image file
for file in image-*
convert $file $file.pdf
rm $file

#Concatenate all the individual pdf files into one single file and delete the original pdf files
pdftk image-*.pdf cat output $1.pdf
rm image-*.pdf

exit 0

I have it configured to scan in black and white, with 200 dpi resolution. This works fine for the majority of things that I do and results in comparatively smaller files. If you wanted color scanning or higher resolutions, you could change them appropriately. I run the script by typing ./scan.sh filename, where filename is whatever I want the output file to be called. The .pdf extension is put on automatically in the script.


In all, the installation process (on Debian) takes approximately 10 minutes and the scanner only costs about $125. I don’t know how many pages per hour can be imaged with this setup, but with my settings it takes approximately 2-3 seconds to scan a page and just a couple of minutes to scan a whole semester’s worth of calculus notes. I’m not sure whether this could replace larger document imaging systems used in some companies because of the ADF size, but for personal and small business purposes it’s a cheap and easy open source solution.

Originally posted at politicalapathy.com.

Powered by

About Adam Drake

  • Interesting approach. At what point are these files Named so they can later be retrieved?

  • James:

    The file name is supplied as an argument to the script. For example, if you were scanning notes from a class or something you would load them in the scanner and type “./scan.sh classnotes”

    That would create a file called classnotes.pdf in your current directory.

    I hope that answered your question, let me know if it didn’t.

  • Adam:

    Thank you. Don’t computers need a file name to be unique?

  • Frank Russo

    I realize that this may be out of the article’s scpe, but what do you use for document storage, organization , and retrieval? I mean, any scanner with a SANE driver and an auto sheet feeder can do what you have described above, but where is the FOSS replacemnt suite for Paperport/Omnipage/etc. I have been looking for one for quite a while now. Currently, I am using krusader to stay organized.

    Thanx Much,
    Frank Russo

  • James:
    Filenames must be unique. I don’t have any checks in the script to confirm that a file with the same name doesn’t already exist but that could be easily added. I didn’t need it for my purposes although I may expand this script later, at which time it will be added.

    I don’t use anything special. If the document is to be network-accessible then it will be put on the network drive. Otherwise, I just move it wherever I want it to reside. I don’t have any experience with either of the two commercial packages that you mentioned, so I can’t really comment on them directly.

  • Brian C

    I do something similar, but instead of converting each file to pdf then putting them together, ImageMagick can do that for you in one step.

    convert -adjoin image-* $1.pdf

    There is no need for pdftk either. Of course, this will fail if you have too many files for the command line to handle, but that’s another issue. I’ve done documents of about 10 pages this way.

  • Tom R

    We use the same scanner for our Doc Imaging.
    Currently we scan our delivery copies (75-100 images a day) for our business. We also automatically capture at time of printing a image of the original sales ticket as well as invoices to customers. We store all tickets, invoices, and delivery scans in MySQL databases and use custom written PHP to search and display data across our intranet. It’s fast and we tie all 3 of those documents togather as well as a image of any associated Purchase Orders. We have it setup so it is all automatic (except for putting the delivery copies into the scanner). All on Open Source Software! (we also use MSACCESS to put data into MySQL tables automatically)

  • Karl O. Pinc

    Having to run around and update proprietary drivers whenever I wanted to upgrade my system drove me nuts. Now I stick with the FOSS drivers and painlessly upgrade from the internet in one step whenever I desire, and am much happier.

    There’s a reason they say “Use binary only drivers, hate life.”

  • Bullwinkle

    I am interested in how you would edit or input text/data into these documents or forms and pdf’s? Proprietary apps such as Paperport have had this capability for quite some time, and Novell/Suse has shown some promise in the doc management field with DjVu. Still much work needs to be done.

    Simple storage/management/retrieval isn’t the real problem here…but it’s certainly a start.

  • Richard Cooke

    I have noticed that this printer has an ethernet port.

    I had great trouble getting my broadband modem up and running, and changed to an ethernet modem, which worked ‘just like that’. Would not the same be true of this printer?

    As a relative linux newbie I find your instructions daunting (though good and detailed), and I run mandrake at the moment (about to try SUSE), which would require some changes.

  • Richard,

    USB Broadband modems have been difficult (if not impossible) to configure with Linux and as you said, they do work fine if connected via ethernet.

    This printer does have an ethernet connection, but as I understand it that is solely for the printing function. The scanning function cannot be used in that way regardless of operating system.

    I am happy that you are trying Linux. The learning curve is more than something like Mac OS but after you are comfortable in the OS it is difficult to imagine switching back.

    If you have any problems or need any help let me know and I’ll do my best to assist you.

  • Richard Cooke

    Thak you,

    I have just recieved said printer. Unfortunately it did not work straight off using the ethernet cable – not sure why. But I will install the rpms tonight and hopefully …! Thanks for your encouragement.

  • Richard,

    I have never used the printer via ethernet, only USB. I hope that it works as well for you as it has for me (the scanning feature that is).

  • Rob Word

    I have the all-in-one unit discussed in this thread. The Ethernet port can be used for scanning. From my understanding, Brother uses saned for network scanning. If you can configure sane on your computer, you should be able to access the scanner over the network.

  • G

    thanks for the great script, i’ve been using it quite a bit lately

    one suggestion… you can substantially reduce the size of pdf files you are creating by tweaking the convert line in the script to read:

    convert $file -compress LZW $file.pdf

  • Thanks for the script, I wanted a bunch more functionality for use with my HP officejet.. (works great in linux over ethernet)

    I didn’t really want the PDF part, so I droped it.. I suppose I could re-add the PDF mode.

    The script is here.

    New features:
    * scans to PNG files
    * has modes for BW, Gray, and Color.
    * has a series of options for paper sizes
    * prompts for dates for filenames
    * supports multi-page documents

    Features I havn’t done yet: (but want to)
    * checking to make sure it won’t over-write files
    * inline OCR
    * PDF mode
    * multiple multi-page documents

    My HP doesn’t seem to auto-detect how many pages are in the feeder, so this version requires you know how many pages are in the hopper ahead of time.

  • Kevin

    Great article, but does the Brother do double sided scanning?

    I am contemplating a similar setup but am leaning towards a Scansnap S300. The only trouble is the Linux drivers are poor, so I would probably connect it using usb over ethernet to a windows box until driver support improved.