pdfsandwich - A generator for sandwich OCR pdfs from
scanned pdf files
pdfsandwich [options] inputfile.pdf
pdfsandwich generates "sandwich" OCR pdf files,
i.e. pdf files which contain only images (no text) will be processed by
optical character recognition (OCR) and the text will be added to each page
invisibly "behind" the images. Note that pdfsandwich needs
the following programs: unpaper, convert, gs, hocr2pdf (for tesseract <
3.03), and tesseract. As tesseract >= 3.03 can write pdf files, hocr2pdf
is only needed for older versions of tesseract. Please visit
http://www.tobias-elze.de/pdfsandwich.
- -convert
- -convert filename : name of convert binary (default: convert)
- -coo
- -coo options : additional convert options; make sure
to quote; e.g. -coo "-normalize -black-threshold
75%" call convert --help or man convert for all convert
options
- -debug
- keep all temporary files in /tmp (for debugging)
- -enforcehocr2pdf
- use hocr2pdf even if tesseract >= 3.03
- -first_page
- -first_page number : number of page to start OCR from (default:
1)
- -gray
- use grayscale for images (default: black and white)
- -grayfilter
- enable unpaper's gray filter; further options can be set by
-unpo
- -gs
- -gs filename : name of gs binary (default: gs); optional, only
required for resizing
- -hocr2pdf
- -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf);
ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is
set
- -hoo
- -hoo options : additional hocr2pdf options; make sure
to quote
- -identify
- -identify filename : name of identify binary (default:
identify)
- -last_page
- -last_page number : number of page up to which to process OCR
(default: number of pages in inputfile)
- -lang
- -lang language : language of the text; option to tesseract
(default: eng) e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ... see
option -list_langs; Multiple languages may be specified, separated
by plus characters.
- -layout
- -layout { single | double | none } : layout of the scanned pages;
requires unpaper single: one page per sheet double: two pages per sheet
none: no auto-layout (default)
- -list_langs
- list currently available languages and exit; in case of custom binaries of
tesseract, place this after the -tesseract option
- -maxpixels
- -maxpixels NUM : maximal number of pixels allowed for input file if
(resolution/72)^2 *width*height > maxpixels then scale page of input
file down prior to OCR so that page size in pixels corresponds to
maxpixels; default: 17415167 (A3 @ 300 dpi)
- -noimage
- do not place the image over the text (requires hocr2pdf; ignored without
-enforcehocr2pdf option)
- -nopreproc
- do not preprocess with unpaper
- -nthreads
- -nthreads number : number of parallel threads (default: guessed
number of CPUs; if guessing fails: 1)
- -o
- -o filename : output file; default: inputfile_ocr.pdf (if extension
is different from .pdf, original extension is kept)
- -omp_thread_limit
- -omp_thread_limit number : number of threads tesseract may use for
each page (default: 1)
- -pagesize
- -pagesize { original | NUMxNUM } : set page size of output pdf
(requires ghostscript) original: same as input file (default) NUMxNUM:
width x height in pixel (e.g. for A4: -pagesize 595x842)
- -pdfinfo
- -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)
- -pdfunite
- -pdfunite filename : name of pdfunite binary (default:
pdfunite)
- -resolution
- -resolution NUM : resolution (dpi) used for OCR (default: 300)
- -rgb
- use RGB color space for images (default: black and white); use with care:
causes problems with some color spaces
- -sloppy_text
- sloppily place text, group words, do not draw single glyphs; ignored for
tesseract >= 3.03 unless option -enforcehocr2pdf is set
- -tesseract
- -tesseract filename : name of tesseract binary (default:
tesseract)
- -tesso
- -tesso options : additional tesseract options; make
sure to quote
- -unpaper
- -unpaper filename : name of unpaper binary (default: unpaper)
- -unpo
- -unpo options : additional unpaper options; make sure
to quote
- -quiet
- suppress output
- -verbose
- produce more output
- -version
- print version and quit
- -help
- Display this list of options
- --help
- Display this list of options
Via Tesseract, numerous language packagess available - follow this
link http://code.google.com/p/tesseract-ocr/downloads/list for a complete
list. Here is an incomplete selection of supported languages and their
abbreviations:
ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan),
ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese),
chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu (German), ell
(Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian),
fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew),
hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita
(Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian),
nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron
(Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian),
spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl
(Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese)
Multiple languages may be specified, separated by plus characters.
Note that the respective tesseract language package needs to be installed on
your system to be usable by pdfsandwich. Option -list_langs
lists the languages which are available on your system.
Sources and packages as well as comprehensive help can be found at
http://www.tobias-elze.de/pdfsandwich.
Tobias Elze <sourceforge@tobias-elze.de>