hocr2djvused - hOCR to djvused script converter
hocr2djvused [option...]
[hocr-file...]
hocr2djvused reads one or more hOCR[1] files (as produced
by OCRopus[2] or Cuneiform[3] or Tesseract[4]) and
converts them to a djvused script.
Unless a filename is explicitly provided on the command line, hOCR
is read from the standard input.
-t lines, --details lines
Record location of every line. Don't record locations of
particular words or characters.
-t words, --details=words
Record location of every line and every word. Don't
record locations of particular characters.
This is the default.
-t chars, --details=chars
Record location of every line, every word and every
character.
--word-segmentation=simple
Consider each non-empty sequence of non-whitespace
characters a single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the
Unicode Text Segmentation[5] algorithm to
break lines into words.
This options break assumptions of some DjVu tools that words are
separated by spaces, and therefore is it not recommended.
--rotation=n
Assume that DjVu pages are rotated by n
degrees.
--page-size=widthxheight
Specifies that page size is
width pixels ×
height pixels.
This option is required for hOCR generated by Cuneiform (< 0.8)
and superfluous otherwise.
--html5
Use a HTML5 parser[6], which is more robust but
slower than the default parser.
--fix-utf8
Attempt to fix UTF-8 encoding issues and eliminate
unwanted control characters.
This option might be needed for hOCR generated by Cuneiform[7] or
Tesseract[8].
--version
Output version information and exit.
-h, --help
Display help and exit.
Please report bugs at:
https://github.com/jwilk/ocrodjvu/issues
- 1.
- hOCR
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
- 2.
- OCRopus
https://code.google.com/p/ocropus/
- 3.
- Cuneiform
https://launchpad.net/cuneiform-linux
- 4.
- Tesseract
https://github.com/tesseract-ocr/tesseract
- 5.
- Unicode Text Segmentation
https://unicode.org/reports/tr29/
- 6.
- HTML5 parser
https://html.spec.whatwg.org/multipage/syntax.html#parsing
- 7.
- https://bugs.launchpad.net/cuneiform-linux/+bug/585418
- 8.
- https://groups.google.com/d/topic/tesseract-issues/NlYJA3GNDMI