TAGSOUP(1) | User Commands | TAGSOUP(1) |
tagsoup - convert nasty, ugly HTML to clean XHTML
java -jar /usr/share/java/tagsoup.jar [ options ] [ files ]
Rectify arbitrary HTML into clean XHTML, using a tailored description of HTML. The output will be well-formed XML, but not necessarily valid XHTML.
TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal processing mode is to accept HTML files on the command line, or from the standard input if none are given, and output them as clean XML to the standard output. The encoding is assumed to be the platform-local encoding on input, and is always UTF-8 on output.
When the --files option is given, each input file is processed into an output file of the corresponding name, with the extension changed to xhtml. If the extension is already xhtml, it is changed to xhtml_.
TagSoup will repair, by whatever means necessary, violations of XML well-formedness. In particular, it will fix up malformed attribute names and supply missing attribute-value quotation marks. More significantly, it supplies end-tags where HTML allows them to be omitted, and sometimes where it doesn't. It will even supply start-tags where necessary; for example, if a document begins with a <li> tag, TagSoup will automatically prefix it with <html><body><ul>.
TagSoup can be fooled by missing close quotes after attribute values, and by incorrect character encodings (it does not contain an encoding guesser).
TagSoup doesn't understand namespace declarations, which are not properly part of HTML. Instead, any element or attribute name beginning foo: will be put into the artificial namespace urn:x-prefix:foo.
For the same reasons, namespace-qualified attributes like xml:space can't be returned as default values, though an explicit attribute in the xml namespace will be returned with the proper namespace URI.
John Cowan <cowan@ccil.org>
Copyright © 2002-2008 John Cowan
TagSoup is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
January 2008 | TagSoup 1.2.1 |