OCRMYPDF(1) | User Commands | OCRMYPDF(1) |
ocrmypdf - add an OCR text layer to PDF files
usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
Generates a searchable PDF or PDF/A from a regular PDF.
OCRmyPDF rasterizes each page of the input PDF, optionally corrects page rotation and performs image processing, runs the Tesseract OCR engine on the image, and then creates a PDF from the OCR information.
OCRmyPDF attempts to keep the output file at about the same size. If a file contains losslessly compressed images, and output file will be losslessly compressed as well.
PDF is a page description file that attempts to preserve a layout exactly. A PDF can contain vector objects (such as text or lines) and raster objects (images). A page might have multiple images. OCRmyPDF is prepared to deal with the wide variety of PDFs that exist in the wild.
When a PDF page contains text, OCRmyPDF assumes that the page has already been OCRed or is a "born digital" page that should not be OCRed. The default behavior is to exit in this case without producing a file. You can use the option --skip-text to ignore pages with text, or --force-ocr to rasterize all objects on the page and produce an image-only PDF as output.
If you are concerned about long-term archiving of PDFs, use the default option --output-type pdfa which converts the PDF to a standardized PDF/A-2b. This converts images to sRGB colorspace, removes some features from the PDF such as Javascript or forms. If you want to minimize the number of changes made to your PDF, use --output-type pdf.
If OCRmyPDF is given an image file as input, it will attempt to convert the image to a PDF before processing. For more control over the conversion of images to PDF, use img2pdf, or other image to PDF software.
For example, this command uses img2pdf to convert all .png files beginning with the 'page' prefix to a PDF, fitting each image on A4-sized paper, and sending the result to OCRmyPDF through a pipe.
after installing the ocrmypdf-doc package.
January 2019 | ocrmypdf 8.0.0+dfsg |