Categories
Scanning

My scanning & upload workflow

The short version of this post is simply:

  • Cut the spine off the magazine
  • Scan it in at 600DPI as an uncompressed TIFF
  • Convert those TIFFs into a bunch of JPGs inside a PDF
  • Put all the TIFFs into a ZIP file
  • Upload the PDF and ZIP file to the Internet Archive
  • Backup the original scans

But that’s no use to you if you’ve never scanned in a magazine or book before and want to get started, is it? Let me break down each step and explain the methods that have worked for me.

Cut

This is the guillotine I use to slice the spines off stuff I scan. It was about $130 off eBay and isn’t the best constructed unit going around (every moving bit rubs against the frame shedding paint & metal) but it can slice through a few hundred pages at once. Often it fails to cut the last two or three sheets so I do that by hand with a sharp knife.

After cutting approx 25,000 pages, I had to get the blade sharpened as it was struggling to do a clean cut. A local saw/tool sharpening place did a great job sharpening it better than when I got it for only $25 and only took an hour. I’d love to get something like the IDEAL 4350, but it costs around $5,000 new. A bit too spicy for a simple hobby.

Scan

Not all scanners are suitable for document archival and mass scanning. Here’s what I’d be looking for if I had to buy a new scanner today:

  • 600dpi optical scan resolution
  • Duplex capability
  • 50-sheet document feeder
  • Driver based image enhancement (e.g: text sharpening, de-screening, prevent bleed through)

Basically any new document scanner from Epson, Canon, Brother, Fujitsu or HP will be fine. I’ve burned a lot of money on second hand scanners that look cheap but have run into so many problems with driver support, I should have just plonked down the cash for a brand new unit and saved so much time.

NAPS2 in action.

NAPS2 is what I use to suck in images from the Canon scanners and Epson Scan 2 works well with the Epson scanner. I don’t do much to the images besides deskewing and cropping anything the scanner’s automatic features fucked up. Here’s a bunch of screenshots of the settings I tend to use in both the Epson and Canon drivers. I prefer using NAPS2’s deskewing feature than the Canon driver’s.

For post-processing of books I use ScanTailor Advanced, a fork of ScanTailor. Here’s a video explaining how to use it.

Upload

I save the scanned images directly to my Linux fileserver, where I run the following bash command on each directory of scanned images:

mogrify -verbose -format jpg -quality 75 -strip *.t*f && img2pdf -v -o /videos/scans/upload/${PWD##*/}.pdf *.jpg && tar -cvf /videos/scans/upload/${PWD##*/}-original-scan-tiffs.tar.gz --use-compress-prog=pigz *.t*f && tar -cvf /videos/scans/archives/${PWD##*/}.tar.bz2 --use-compress-prog=pbzip2 *.t*f && ia upload ${PWD##*/} /videos/scans/upload/${PWD##*/}.pdf /videos/scans/upload/${PWD##*/}-original-scan-tiffs.tar.gz --retries 10 --metadata="mediatype:texts" --metadata="title:${PWD##*/}" --metadata="description:uncompressed TIFF 600dpi scans available in tar.gz file" --metadata="subject:magazine, electronics, science, australia" --metadata="language:eng" && rm -f /videos/scans/upload/${PWD##*/}.pdf && rm -f /videos/scans/upload/${PWD##*/}-original-scan-tiffs.tar.gz && rm -rf $PWD && cd /scans

What a mongrel bastard of a shell command! Basically it takes the directory of TIFF files and:

  1. Converts them into JPEGs with 75% compression.
  2. Uses img2pdf to add the JPEGs into a PDF without re-compressing the images.
  3. Archives the original TIFFs with gzip for the Internet Archive.
  4. Archives the original TIFFs with bzip2 (25-30% smaller than zip) for storage on my server.
  5. Uploads the PDF and zip files to the Internet Archive using the directory name as the unique identifier, along with some basic metadata.
  6. Delete the PDF, ZIP, TIFFs and JPEGs.

The reason I do it this way instead of just uploading a PDF to the Internet Archive deserves its own post that I’ll write up eventually. I go through this meandering process for a good reason, trust me!

An uninteresting screenshot of what it looks like when a magazine has finished being uploaded to the Internet Archive via my bastard command.

Because I’m a Linux hack (despite being paid in the past to babysit Linux servers for living) who is bad with shell scripts, I use byobu to open up multiple shell sessions that’ll run convert and upload each magazine in the background.

Backup

At the end of all this, I’m left with a multiple gigabyte bzip2 file of the original scans. I don’t keep the PDFs because it’s great to view the magazines on the Internet Archive whenever I want and if did ever need to re-create the PDFs, I can do so easily from the original TIFF files in the archive.

Right now my backups of the bzip2 files consist of a copy on my local file server and uploading them to one of the spare OneDrive accounts that comes with my Office 365 subscription. I’ve got 4x 1TB accounts there I can stash stuff for “free” as I’m already paying for Office so its no extra cost for me.

Long term, I need to look for a better solution as I’ve already got over 1.2TB of archived magazines and I’ve just started! Something like Amazon Glacier or Backblaze B2 sound good in theory but when I end up with 4TB or more of data in the cloud, those monthly costs add up. Finding something cost effective (tapes stored at my parents place maybe?) will be an interesting project.

Leave a Reply

Your email address will not be published. Required fields are marked *