htmlstrip - Strip HTML markup code
htmlstrip [-o outputfile] [-O
level] [-b blocksize] [-v]
[inputfile]
HTMLstrip reads inputfile or from
"stdin" and strips the contained HTML
markup. Use this program to shrink and compactify your HTML files in a safe
way.
There are three disjunct types of content which are recognized by
HTMLstrip while parsing:
- HTML Tag (tag)
- This is just a single HTML tag, i.e. a string beginning with a opening
angle bracket directly followed by an identifier, optionally followed by
attributes and ending with a closing angle bracket.
- Preformatted
(pre)
- This is any contents enclosed in one of the following container tags:
1. <nostrip>
2. <pre>
3. <xmp>
The non-HTML-3.2-conforming
"<nostrip>" tag is special here:
It acts like "<pre>" as a
protection container for HTMLstrip but is also stripped from the output.
Use this as a pseudo-block which just preserves its body for the
HTMLstrip processing but itself is removed from the output.
- Plain Text (txt)
- This is anything not falling into one of the two other categories, i.e any
content both outside of preformatted areas and outside of HTML tags.
The amount of stripping can be controlled by a optimization level,
specified via option -O (see below). Higher levels also include all
of the lower levels. The following stripping is done on each level:
- Level 0:
- No real stripping, just removing the sharp/comment-lines
("#...") [txt,tag]. Such lines are a
standard feature of WML, so this is always done.
- Level 1:
- Minimal stripping: Same as level 0 plus stripping of blank and empty lines
[txt].
- Level 2:
- Good stripping: Same as level 1 plus compression of multiple whitespaces
(more then one in sequence) to single whitespaces [txt,tag] and stripping
of trailing whitespaces at the of of a line [txt,tag,pre].
This level is the default because while providing good
optimization the HTML markup is not destroyed and remains human
readable.
- Level 3:
- Best stripping: Same as level 2 plus stripping of leading whitespaces on a
line [txt]. This can also be recommended when you still want to make sure
that the HTML markup is not destroyed in any case. But the resulting code
is a little bit ugly because of the removed whitespaces.
- Level 4:
- Expert stripping: Same as level 3 plus stripping of HTML comment lines
(``"<!-- ... -->"'') and crunching
of HTML tag endsi [tag]. BE CAREFUL HERE: Comment lines are
widely used for hiding some Java or JavaScript code for browsers which are
not capable of ignoring those stuff. When using this optimization level
make sure all your JavaScript code is hided correctly by adding
HTMLstrip's "<nostrip>" tags
around the comment delimiters.
- Level 5:
- Crazy stripping: Same as level 4 plus wrapping lines around to fit in an
80 column view window. This saves some newlines but both leads to really
unreadable markup code and opens the window for a lot of problems when
this code is used to layout the page in a browser. Use with care. This
is only experimental!
Additionally the following global strippings are done:
- "^\n":
- A leading newline is always stripped.
- "<suck>":
- The "<suck>" tag just absorbs
itself and all whitespaces around it. This is like the backslash for
line-continuation, but is done in Pass 8, i.e. really at the end. Use this
inside HTML tag definitions to absorb whitespaces, for instance around
%body when used inside
"<table>" structures which at some
point are newline-sensitive in Netscape Navigator.
- -o outputfile
- This redirects the output to outputfile. Usually the output will be
send to "stdout" if no such option is
specified or outputfile is
""-"".
- -O level
- This sets the optimization/stripping level, i.e. how much HTMLstrip should
compress the contents.
- -b blocksize
- For efficiency reasons, input is divided into blocks of 16384 chars. If
you have some performance problems, you may try to change this value. Any
value between 1024 and
32766 is allowed. With a value of
0, input is not divided into blocks.
- -v
- This sets verbose mode where some processing information will be given on
the console.
Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com
Denis Barbier
barbier@engelschall.com