html2pdbtxt(1) | General Commands Manual | html2pdbtxt(1) |
html2pdbtxt - HTML to Doc Text converter for Palm Pilots
html2pdbtxt [ -bchars ] [
-ttitle ] [ -uURL ] file.html [
file.txt ]
html2pdbtxt -v
html2pdbtxt converts HTML to text suitable for conversion to a Doc(4) file via txt2pdbdoc(1). If no text filename is given, the generated text is sent to standard output.
The following HTML tags (and corresponding ending tags) are recognized: ADDRESS, A NAME, BLOCKQUOTE, BR, CENTER, DIV, DL, DT, H1, H2, H3, H4, H5, H6, OL, OPTION, PRE, P, SELECT, SCRIPT, STYLE, TABLE, TITLE, UL. In all cases, the most ``reasonable'' thing is done given the constraints of the Doc(4) format which is essentially plain text. ALT attributes (typically found in IMG tags) have their text extracted and placed between brackets [like this]. All other HTML tags are stripped.
Both HTML character and numeric (decimal and hexadecimal) entity references are converted to their byte value according to the ISO 8859-1 (Latin 1) character set so they appear properly on the Pilot. For example, ``résumé'' becomes ``resume'' with accented letter 'e's.
Unless specified with the -t option, the HTML file is scanned for <TITLE> ... </TITLE> tags and, if found, the title is extracted and put on line 1 of the generated file.
Bookmarks are placed into the generated file wherever <A NAME="..."> tags are found in the HTML file.
To convert an HTML file to Doc:
html2pdbtxt -u http://www.wonderland.org/ alice.html alice.txt txt2pdbdoc "`head -1 alice.txt`" alice.txt alice.pdb
pdbtxt2html(1), txt2pdbdoc(1), doc(4), pdb(4)
International Standards Organization. ``ISO 8859-1: Information Processing -- 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1.'' 1987.
World Wide Web Consortium. ``Character entity references in HTML 4.0.'' HTML 4.0 Specification, http://www.w3.org/
Paul J. Lucas <pauljlucas@mac.com>
January 21, 2005 | html2pdbtxt |