PDF2TXT(1) | PDFMiner Manual | PDF2TXT(1) |
pdf2txt - extracts text contents of PDF files
pdf2txt [option...] file...
pdf2txt extracts text contents from a PDF file. It extracts all the text that is to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.
-o file
-p pageno[,pageno,...]
-c codec
-t type
text
html
xml
tag
-D writing-mode
lr-tb
tb-rl
auto
-M char-margin, -L line-margin, -W word-margin
Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question. The default values are char-margin = 1.0, line-margin = 0.3, and W = 0.2, respectively.
-n
-A
-V
-s scale
-m n
-P password
-d
Extract text as an HTML file whose filename is output.html:
$ pdf2txt -o output.html samples/naacl06-shinyama.pdf
Extract a Japanese HTML file in vertical writing:
$ pdf2txt -c euc-jp -D tb-rl -o output.html samples/jo.pdf
Extract text from an encrypted PDF file:
$ pdf2txt -P mypassword -o output.txt secret.pdf
Jakub Wilk <jwilk@debian.org>
Yusuke Shinyama <yusuke@cs.nyu.edu>
01/12/2019 | pdf2txt |