Performing Optical Character Recognition (OCR) on PDFs

Scanned PDFs are often not searchable. Using OCR the text can be made searchable and selectable. ocrmypdf is a command line tool to perform the task.

ocrmypdf -l eng -l deu --optimize 3 input.pdf output.pdf

For extra language supports install the matching packages like tesseract-langpack-eng and tesseract-langpack-deu from your package manager. Use -l eng -l deu on the command line to specify the languages. The default optimization is lossless. With --optimize 3 the JPEG images get compressed as well, which is often useful for scanned PDFs.

The documentation for ocrmypdf is online: https://ocrmypdf.readthedocs.io/