Convert multiple images to searchable PDF (OCR) with free command line tools
In this example, we'll show how to convert multiple PNG images to a multi page searchable PDF file. We'll use the following command line tools:
- ImageMagick for converting PNGs into multi page TIFF and PDF files.
- Tesseract OCR, an open source OCR engine.
- PDFtk Free for overlay joining of PDF files.
Assume we have multiple PNG files sorted in order
image01.png
image02.png
image03.png
image04.png
image05.png
Convert the images to images.tiff
and images.pdf
. The files may need rotating (-rotate
) or trimming (-shave
).
magick convert -rotate 270 -shave 0x410 *.png images.tiff
magick convert images.tiff images.pdf
Here, we are creating both images.tiff
and images.pdf
. The reason we're doing both is because tesseract reads TIFF images. The conversion of these images should be lossless, but you may want to just double check it is indeed and check for compression artifacts.
Read the images.tiff
file in English and produce a seperate text only PDF text.pdf
using Tesseract OCR
tesseract images.tiff text -l eng -c textonly_pdf=1 pdf
The combine both PDF files images.pdf
and text.pdf
one on top of another using PDFtk, generating a new file full.pdf
pdftk text.pdf multibackground images.pdf output full.pdf
The file full.pdf
should now be a PDF file with searchable text.