DOKK / manpages / debian 10 / ocrodjvu / hocr2djvused.1.en
HOCR2DJVUSED(1) hocr2djvused manual HOCR2DJVUSED(1)

hocr2djvused - hOCR to djvused script converter

hocr2djvused [option...] [hocr-file...]

hocr2djvused reads one or more hOCR[1] files (as produced by OCRopus[2] or Cuneiform[3] or Tesseract[4]) and converts them to a djvused script.

Unless a filename is explicitly provided on the command line, hOCR is read from the standard input.

-t lines, --details lines

Record location of every line. Don't record locations of particular words or characters.

-t words, --details=words

Record location of every line and every word. Don't record locations of particular characters.

This is the default.

-t chars, --details=chars

Record location of every line, every word and every character.

--word-segmentation=simple

Consider each non-empty sequence of non-whitespace characters a single word.

This is the default, despite being linguistically incorrect.

--word-segmentation=uax29

Use the Unicode Text Segmentation[5] algorithm to break lines into words.

This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.

--rotation=n

Assume that DjVu pages are rotated by n degrees.

--page-size=widthxheight

Specifies that page size is width pixels × height pixels.

This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.

--html5

Use a HTML5 parser[6], which is more robust but slower than the default parser.

--fix-utf8

Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters.

This option might be needed for hOCR generated by Cuneiform[7] or Tesseract[8].

--version

Output version information and exit.

-h, --help

Display help and exit.

Please report bugs at: https://github.com/jwilk/ocrodjvu/issues

djvu(1), ocrodjvu(1), djvu2hocr(1), djvused(1)

1.
hOCR
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
2.
OCRopus
https://code.google.com/p/ocropus/
3.
Cuneiform
https://launchpad.net/cuneiform-linux
4.
Tesseract
https://github.com/tesseract-ocr/tesseract
5.
Unicode Text Segmentation
https://unicode.org/reports/tr29/
6.
HTML5 parser
https://html.spec.whatwg.org/multipage/syntax.html#parsing
7.
https://bugs.launchpad.net/cuneiform-linux/+bug/585418
8.
https://groups.google.com/d/topic/tesseract-issues/NlYJA3GNDMI

2018-07-12 hocr2djvused 0.10.4