Open Source: Cheap Open Source Document Imaging

Part of: Open Source

Like many companies and individuals, I still have to deal with paper. Even though it's becoming an increasingly digital world, paper still reigns supreme in some areas. I have looked at document imaging systems in the past when I worked at another company, and found them to be quite expensive. Below is my open source solution to this common problem, and it only cost me hardware.

Step 1: Find a good scanner

Sometimes it's hard to get good hardware that is supported in Linux. Many major manufacturers are now offering Linux drivers and Brother happens to be one of them. They have a line of multi-function machines that work very well in Linux with their supplied driver (at least in my experience). The scanner I use is the MFC5440CN. The reason I use this particular model is because it has a 35 page automatic document feeder (ADF). That may not seem like very much capacity, but if you have a longer document you can add pages to the ADF while it's scanning. In addition to the ADF, the scanner can be acquired for little money; Amazon has it listed for $123.49 with free shipping. A USB cable will have to be acquired separately (if you don't already have one) because it is not included with the scanner.

Step 2: Install drivers & software

This part is incredibly simple compared to some other devices I've tried to use, partially because of the simple package management in Debian Linux and partly because Brother supplies a good open source driver. You can simply navigate to their page that contains Linux drivers and download the appropriate one for your distribution. I used the brscan2 Debian installer package.

After the driver is downloaded, it can be installed by typing (as root) "dpkg -i brscan2-0.0.2-1.i386.deb" (without quotes). The only other software required for this is SANE, ImageMagick, and pdftk. In Debian Linux these are easily installed by running "apt-get install sane imagemagick pdftk" (again, without quotes).

Now that the driver and software is installed, some configuration may be necessary in /etc/fstab depending on your configuration. If you already have a line for /proc/bus/usb in your /etc/fstab then you will need to modify it with your favorite text editor (VIM) to read:

(users with 2.4.x kernels use usbdevfs instead of usbfs)
none /proc/bus/usb usbfs auto,devmode=0666 0 0

Continued on the next page Page 1 — Page 2

Article tags

Spread the word
Bookmark and Share
Read comments on this article, and add some feedback of your own
  • No image found

Article comments

  • 1 - James Eglin

    Feb 01, 2006 at 12:26 pm

    Interesting approach. At what point are these files Named so they can later be retrieved?

  • 2 - Adam Drake

    Feb 01, 2006 at 12:58 pm

    James:

    The file name is supplied as an argument to the script. For example, if you were scanning notes from a class or something you would load them in the scanner and type "./scan.sh classnotes"

    That would create a file called classnotes.pdf in your current directory.

    I hope that answered your question, let me know if it didn't.

  • 3 - James Eglin

    Feb 01, 2006 at 3:59 pm

    Adam:

    Thank you. Don't computers need a file name to be unique?

  • 4 - Frank Russo

    Feb 01, 2006 at 4:16 pm

    I realize that this may be out of the article's scpe, but what do you use for document storage, organization , and retrieval? I mean, any scanner with a SANE driver and an auto sheet feeder can do what you have described above, but where is the FOSS replacemnt suite for Paperport/Omnipage/etc. I have been looking for one for quite a while now. Currently, I am using krusader to stay organized.

    Thanx Much,
    Frank Russo

  • 5 - Adam Drake

    Feb 02, 2006 at 9:12 am

    James:
    Filenames must be unique. I don't have any checks in the script to confirm that a file with the same name doesn't already exist but that could be easily added. I didn't need it for my purposes although I may expand this script later, at which time it will be added.

    Frank:
    I don't use anything special. If the document is to be network-accessible then it will be put on the network drive. Otherwise, I just move it wherever I want it to reside. I don't have any experience with either of the two commercial packages that you mentioned, so I can't really comment on them directly.

  • 6 - Brian C

    Feb 02, 2006 at 11:29 am

    I do something similar, but instead of converting each file to pdf then putting them together, ImageMagick can do that for you in one step.

    convert -adjoin image-* $1.pdf

    There is no need for pdftk either. Of course, this will fail if you have too many files for the command line to handle, but that's another issue. I've done documents of about 10 pages this way.

  • 7 - Tom R

    Feb 02, 2006 at 1:55 pm

    We use the same scanner for our Doc Imaging.
    Currently we scan our delivery copies (75-100 images a day) for our business. We also automatically capture at time of printing a image of the original sales ticket as well as invoices to customers. We store all tickets, invoices, and delivery scans in MySQL databases and use custom written PHP to search and display data across our intranet. It's fast and we tie all 3 of those documents togather as well as a image of any associated Purchase Orders. We have it setup so it is all automatic (except for putting the delivery copies into the scanner). All on Open Source Software! (we also use MSACCESS to put data into MySQL tables automatically)

  • 8 - Karl O. Pinc

    Feb 02, 2006 at 7:24 pm

    Having to run around and update proprietary drivers whenever I wanted to upgrade my system drove me nuts. Now I stick with the FOSS drivers and painlessly upgrade from the internet in one step whenever I desire, and am much happier.

    There's a reason they say "Use binary only drivers, hate life."

  • 9 - Bullwinkle

    Mar 09, 2006 at 11:01 am

    I am interested in how you would edit or input text/data into these documents or forms and pdf's? Proprietary apps such as Paperport have had this capability for quite some time, and Novell/Suse has shown some promise in the doc management field with DjVu. Still much work needs to be done.

    Simple storage/management/retrieval isn't the real problem here...but it's certainly a start.

  • 10 - Richard Cooke

    May 05, 2006 at 10:54 am

    I have noticed that this printer has an ethernet port.

    I had great trouble getting my broadband modem up and running, and changed to an ethernet modem, which worked 'just like that'. Would not the same be true of this printer?

    As a relative linux newbie I find your instructions daunting (though good and detailed), and I run mandrake at the moment (about to try SUSE), which would require some changes.

  • 11 - Adam Drake

    May 05, 2006 at 12:50 pm

    Richard,

    USB Broadband modems have been difficult (if not impossible) to configure with Linux and as you said, they do work fine if connected via ethernet.

    This printer does have an ethernet connection, but as I understand it that is solely for the printing function. The scanning function cannot be used in that way regardless of operating system.

    I am happy that you are trying Linux. The learning curve is more than something like Mac OS but after you are comfortable in the OS it is difficult to imagine switching back.

    If you have any problems or need any help let me know and I'll do my best to assist you.

  • 12 - Richard Cooke

    May 10, 2006 at 8:20 am

    Thak you,

    I have just recieved said printer. Unfortunately it did not work straight off using the ethernet cable - not sure why. But I will install the rpms tonight and hopefully ...! Thanks for your encouragement.

  • 13 - Adam Drake

    May 10, 2006 at 9:37 pm

    Richard,

    I have never used the printer via ethernet, only USB. I hope that it works as well for you as it has for me (the scanning feature that is).

  • 14 - Rob Word

    Dec 14, 2006 at 1:37 pm

    I have the all-in-one unit discussed in this thread. The Ethernet port can be used for scanning. From my understanding, Brother uses saned for network scanning. If you can configure sane on your computer, you should be able to access the scanner over the network.

  • 15 - G

    Jan 05, 2007 at 5:56 pm

    thanks for the great script, i've been using it quite a bit lately

    one suggestion... you can substantially reduce the size of pdf files you are creating by tweaking the convert line in the script to read:

    convert $file -compress LZW $file.pdf

  • 16 - SuperQ

    Sep 02, 2007 at 4:41 pm

    Thanks for the script, I wanted a bunch more functionality for use with my HP officejet.. (works great in linux over ethernet)

    I didn't really want the PDF part, so I droped it.. I suppose I could re-add the PDF mode.

    The script is here.

    New features:
    * scans to PNG files
    * has modes for BW, Gray, and Color.
    * has a series of options for paper sizes
    * prompts for dates for filenames
    * supports multi-page documents

    Features I havn't done yet: (but want to)
    * checking to make sure it won't over-write files
    * inline OCR
    * PDF mode
    * multiple multi-page documents

    My HP doesn't seem to auto-detect how many pages are in the feeder, so this version requires you know how many pages are in the hopper ahead of time.

  • 17 - Kevin

    Jun 12, 2008 at 1:22 am

    Great article, but does the Brother do double sided scanning?

    I am contemplating a similar setup but am leaning towards a Scansnap S300. The only trouble is the Linux drivers are poor, so I would probably connect it using usb over ethernet to a windows box until driver support improved.

Add your comment, speak your mind

Personal attacks are NOT allowed.
Please read our comment policy.
Please preview your comment.

blogcritics lists for May 29, 2012

fresh articles Most recent articles site-wide

fresh comments Most recent comments site-wide

most comments Most comments in 24hrs

top writers Most prolific Blogcritics for April

top commenters Most prolific Commenters in 24 hrs