Convert multiple images to searchable PDF (OCR) with free command line tools

2020-04-26

In this example, we'll show how to convert multiple PNG images to a multi page searchable PDF file. We'll use the following command line tools:

ImageMagick for converting PNGs into multi page TIFF and PDF files.
Tesseract OCR, an open source OCR engine.
PDFtk Free for overlay joining of PDF files.

Assume we have multiple PNG files sorted in order

image01.png
image02.png
image03.png
image04.png
image05.png

Convert the images to images.tiff and images.pdf. The files may need rotating (-rotate) or trimming (-shave).

magick convert -rotate 270 -shave 0x410 *.png images.tiff
magick convert images.tiff images.pdf

Here, we are creating both images.tiff and images.pdf. The reason we're doing both is because tesseract reads TIFF images. The conversion of these images should be lossless, but you may want to just double check it is indeed and check for compression artifacts.

Read the images.tiff file in English and produce a seperate text only PDF text.pdf using Tesseract OCR

tesseract images.tiff text -l eng -c textonly_pdf=1 pdf

The combine both PDF files images.pdf and text.pdf one on top of another using PDFtk, generating a new file full.pdf

pdftk text.pdf multibackground images.pdf output full.pdf

The file full.pdf should now be a PDF file with searchable text.