HXPIPE(1) | HTML-XML-utils | HXPIPE(1) |
hxpipe - convert XML file to a format easier to parse with Perl or AWK
hxpipe [ -l ] [ -- ] [ file-or-URL ]
hxpipe parses an HTML or XML file and outputs a line-oriented representation of it that is well suited to further processing with AWK or similar tools. The format is similar to the ESIS (Element Structure Information Set) that is output by nsgmls/onsgmls.
The reverse operation, converting back to mark-up, is performed by the hxunpipe program.
The output format is as follows:
*comment
I.e., a single line starting with "*" followed by the text of the comment. Line feeds, carriage returns and tabs in the text are written as "\n", "\r" and "\t", respectively. Text that looks like a numerical character entity is written with the "&" replaced by "\". The line ends with a line feed.
?processing instruction
I.e., a single line starting with a "?" followed by the text of the processing instruction. The text is escaped as for comments (see above).
!root "-//foo//DTD bar//EN" http://example.org/dtd !root "-//foo//DTD bar//EN" !root "" http://example.org/dtd !root ""
for respectively: a DOCTYPE with (1) both a public and a system identifier, (2) only a public identifier, (3) only a system identifier, or (4) neither of the two. I.e., a single line starting with a "!", followed by a space and a possibly empty quoted string, followed optionally by a space and arbitrary text. Note the quotes for the public identifier and the absence of quotes for the system identifier.
Aatt1 CDATA value1 Aatt2 CDATA value2 (elt
I.e., as zero or more lines for the attributes and one line for the element type. Each line for an attribute starts with "A" followed by the name of the attribute, a space, the literal string "CDATA", another space, and the attribute value. The text of the attribute value is escaped as for comments (see above). The line for the element type starts with "(" followed by the element type.
)elt
I.e., as a line starting with ")" followed by the element type.
Aatt1 CDATA val1 Aatt2 CDATA val2 |empty
I.e., as zero or more lines for attributes and one line starting with "|" followed by the element type.
-text
I.e., as a single line starting with a "-". The text is escaped as for comments (see above).
L12
where "12" is replaced with the line number in the source where the next output came from.
hxpipe does not normalize the input and does not add mising tags. It is thus possible that there are unequal numbers of "(" and ")" lines. If it is important that every start tag is matched by an end tag, pipe the input through hxnormalize -x first.
The following options are supported:
The following operand is supported:
The following exit values are returned:
To use a proxy to retrieve remote files, set the environment variables http_proxy and ftp_proxy. E.g., http_proxy="http://localhost:8080/"
The error recovery for incorrect HTML is primitive. hxnormalize can currently only retrieve remote files over HTTP. It doesn't handle password-protected files, nor files whose content depends on HTTP "cookies."
10 Jul 2011 | 7.x |