Getting Documents Back From JPEG Scans

We’re all looking for documentation, books, and papers. Sometimes we’re lucky, we find the pristine PDF, rendered fresh from a text processor or maybe LaTeX. Sometimes we’re not so lucky, the only thing we can find is a collection of JPEG images with high compression ratios.

Scans of text are not always easy to clean up, even when they’re well done to begin with, they may be compressed with JPEG using a (too) high compression ratio, leading to conspicuous artifacts. These artifacts must be cleaned-up before printing or binding together in a PDF.

Take for example the classic book from Abramowitz and Stegun, a (dated) reference everyone should (nonetheless) have handy. Since it was produced by the US government, it is public domain (as interpreted by some) and therefore available for fair use. Say you find a copy here. You setup a quick wget script to grab all available pages.

First things you notice is that the pages aren’t of very high resolution and that they exhibit conspicuous JPEG artifacts (as shown in the first picture above). How does JPEG creates these “mosquitoes” ?

JPEG (and I’ll be brief, that’s not really the topic of this post) decomposes the images into frequency bands using the discrete cosine transform, or DCT. The DCT is a transform that compacts energy very well; in other words, it packs a lot of information about the shape of the signal in its low-frequency components. This, in turn, means that high-frequency components are associated to finer detail. If moderate compression is applied, these high-frequency components are reduced (partly destroyed) but the reconstructed signal doesn’t show much degradation because most of the signal’s information is packed in the low-frequency components. However, if you crank up compression, even the low-frequency components get degraded and the reconstruction is very bad. Somewhere between very light and very heavy compression, all kind of artifacts appear. One very typical artifact is “ringing” that takes the appearance of flies buzzing around edges—thus its name of “mosquito noise”.

Text images comprises only edges (the limit between text and paper) and are therefore very sensitive to moderate or high JPEG compression. The result is that such an image is unsuitable for print, and not very pleasant to read on the screen either.

The solution? Cleaning up the images!

The ideal solution is to get software that is specially conceived for this task, but I don’t know any. The GREYCstoration plug-in in GIMP doesn’t work well on text images—but does a great job on natural images, and I use it on occasions.

The thing we want to get rid of, the mosquitoes, are, in this case, rather pale gray compared to the text. The fact that the mosquitoes are pale gives the idea of using thresholding but if we do it at the original resolution of the image, we also loose the smoothness of the individual characters. If we upscale the image, say, 5× with a smart interpolation algorithm, we can then threshold and keep most of the character edges precise. Then we downscale back to the original resolution with anti-aliasing shades of gray.

This simple method is very effective, despite everything; compare:

Page 261 (before)

Page 261 (after)

and:

Page 297 (before)

Page 297 (after)

So the idea is to scale the image up, approximate smooth curves/character edges with interpolation, convert to b/w with a cromulent threshold, and scale back down (or not, for printing, higher resolution may be preferable). (Also, you probably noticed the extra border around the ‘after’ images. That’s because the original images aren’t all of the exact same size, which may cause problems when printing or assembling into a PDF.)

A piece of the image, scaled up, showing artifacts

Same image, thresholded (hard edges)

Scaled down, produced from the thresholded image

*
* *

I used Imagemagick to do the image processing. It’s clunky, unfriendly, but currently one of the most powerful tools around. Originally only capable of converting between image formats, Imagemagick eventually grew to be a quite capable image processing software.

convert $page \
    ${rotate[@]} \
    -filter blackman \
    -resize 500\% \
    -blur 6.0 \
    -threshold 55\% \
    -background white \
    -gravity center \
    -extent ${max_width}x${max_height} \
    ${extra_options[@]} \
    -quality 10 \
    ${new_name}

*
* *

Another minor annoyance is that some of the pages from the archive are numbered using Roman numerals, which doesn’t play well with computers, especially if we want to have a script to reassemble the pages in the right order.We must convert the roman number into normal numbers (Arabic numerals) before we can do anything with it.

The pages are all processed by a Bash script, and the to_arabic function is as follows:

to_arabic()
 {
  last=' '
  n=$1
  v=0
  has_trailing=0
  while [ ${#n} -gt 0 ]
  do
      d=${n:0:1}
      n=${n:1}
      case $d in
          I) ((v+=1))
             ;;

          V) ((v+=5))
             [ "$last" == "I" ] && ((v-=2))
             ;;

          X) v=$((v+10))
             [ "$last" == "I" ] && ((v-=2))
             ;;

          *) # for trailing a,b,c,...,j
             v=${v}$(($(printf "%d" "'$d")-96))
             has_trailing=1
             ;;
      esac

      last=$d
  done

  # cross-cutting concern here,
  # but that's the best place for
  # it
  #
  [ $has_trailing == 0 ] && v=${v}0
  echo $v
 }

(The Bash function doesn’t deal with hundreds or more; but could easily be extended to do so. Also, some pages are numbered with trailing letters, such as IIIa, and the function deals with them, introducing some cross-cutting concerns because what is it you do with IIIa? 3.1? 3a?)

Roman numerals are probably one of the stupidest numerical representation ever. I often joke that the fall of the Roman Empire is due to their number system. Ever tried to carry large multiplies or division in Roman numerals? That’d be enough to send you into murderous, conquering rage.

*
* *

This technique is, of course, suitable only for images moderately damaged by compression but rather clean from the start. The scans must be free of darker regions (as from the crease in the middle of the book) and other artifacts such as discoloration for this method to be helpful. Otherwise, you’d need extensive preprocessing to get a decent cleaned-up version.

*
* *

The complete script can be found here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: