Performing Optical Character Recognition (OCR) on PDFs
Scanned PDFs are often not searchable. Using OCR the text can be made searchable and selectable. ocrmypdf
is a command line tool to perform the task.
For extra language supports install the matching packages like tesseract-langpack-eng
and tesseract-langpack-deu
from your package manager. Use -l eng -l deu
on the command line to specify the languages. The default optimization is lossless. With --optimize 3
the JPEG images get compressed as well, which is often useful for scanned PDFs.
The documentation for ocrmypdf
is online: https://ocrmypdf.readthedocs.io/