HTML::StripScripts(3pm) | User Contributed Perl Documentation | HTML::StripScripts(3pm) |
HTML::StripScripts - Strip scripting constructs out of HTML
use HTML::StripScripts; my $hss = HTML::StripScripts->new({ Context => 'Inline' }); $hss->input_start_document; $hss->input_start('<i>'); $hss->input_text('hello, world!'); $hss->input_end('</i>'); $hss->input_end_document; print $hss->filtered_document;
This module strips scripting constructs out of HTML, leaving as much non-scripting markup in place as possible. This allows web applications to display HTML originating from an untrusted source without introducing XSS (cross site scripting) vulnerabilities.
You will probably use HTML::StripScripts::Parser rather than using this module directly.
The process is based on whitelists of tags, attributes and attribute values. This approach is the most secure against disguised scripting constructs hidden in malicious HTML documents.
As well as removing scripting constructs, this module ensures that there is a matching end for each start tag, and that the tags are properly nested.
Previously, in order to customise the output, you needed to subclass "HTML::StripScripts" and override methods. Now, most customisation can be done through the "Rules" option provided to "new()". (See examples/declaration/ and examples/tags/ for cases where subclassing is necessary.)
The HTML document must be parsed into start tags, end tags and text before it can be filtered by this module. Use either HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you want to input an unparsed HTML document.
See examples/direct/ for an example of how to feed tokens directly
to
HTML::StripScripts.
$s = HTML::Stripscripts->new({ Context => 'Document|Flow|Inline|NoTags', BanList => [qw( br img )] | {br => '1', img => '1'}, BanAllBut => [qw(p div span)], AllowSrc => 0|1, AllowHref => 0|1, AllowRelURL => 0|1, AllowMailto => 0|1, EscapeFiltered => 0|1, Rules => { See below for details }, });
If present, the "Context" value must be one of:
The default "Context" value is "Flow".
For example, in a guestbook application where "HR" tags are used to separate posts, you may wish to prevent posts from including "HR" tags, even though "HR" is not an XSS risk.
For instance:
<br> --> <br>
The focus is safety-first, so it is applied after all of the previous validation. This means that you cannot all malicious data should already have been cleared.
Rules can be specified for tags and for attributes. Any tag or attribute not explicitly listed will be handled by the default "*" rules.
The following is a synopsis of all of the options that you can use to configure rules. Below, an example is broken into sections and explained.
Rules => { tag => 0 | 1 | sub { tag_callback } | { attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, required => [qw(attrname attrname)], tag => sub { tag_callback } }, '*' => 0 | 1 | sub { tag_callback } | { attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, tag => sub { tag_callback } } }
EXAMPLE:
Rules => { ########################## ##### EXPLICIT RULES ##### ########################## ## Allow <br> tags, reject <img> tags br => 1, img => 0, ## Send all <div> tags to a sub div => sub { tag_callback }, ## Allow <blockquote> tags,and allow the 'cite' attribute ## All other attributes are handled by the default C<*> blockquote => { cite => 1, }, ## Allow <a> tags, and a => { ## Allow the 'title' attribute title => 1, ## Allow the 'href' attribute if it matches the regex href => '^http://yourdomain.com' OR href => qr{^http://yourdomain.com}, ## 'style' attributes are handled by a sub style => sub { attr_callback }, ## All other attributes are rejected '*' => 0, ## Additionally, the <a> tag should be handled by this sub tag => sub { tag_callback}, ## If the <a> tag doesn't have these attributes, filter the tag required => [qw(href title)], }, ########################## ##### DEFAULT RULES ##### ########################## ## The default '*' rule - accepts all the same options as above. ## If a tag or attribute is not mentioned above, then the default ## rule is applied: ## Reject all tags '*' => 0, ## Allow all tags and all attributes '*' => 1, ## Send all tags to the sub '*' => sub { tag_callback }, ## Allow all tags, reject all attributes '*' => { '*' => 0 }, ## Allow all tags, and '*' => { ## Allow the 'title' attribute title => 1, ## Allow the 'href' attribute if it matches the regex href => '^http://yourdomain.com' OR href => qr{^http://yourdomain.com}, ## 'style' attributes are handled by a sub style => sub { attr_callback }, ## All other attributes are rejected '*' => 0, ## Additionally, all tags should be handled by this sub tag => sub { tag_callback}, },
sub tag_callback { my ($filter,$element) = (@_); $element = { tag => 'tag', content => 'inner_html', attr => { attr_name => 'attr_value', } }; return 0 | 1; }
A tag callback accepts two parameters, the $filter object and the C$element>. It should return 0 to completely ignore the tag and its content (which includes any nested HTML tags), or 1 to accept and output the tag.
The $element is a hash ref containing the keys:
If for instance, you wanted to replace "<b>" tags with "<span>" tags, you could do this:
sub b_callback { my ($filter,$element) = @_; $element->{tag} = 'span'; $element->{attr}{style} = 'font-weight:bold'; return 1; }
sub attr_callback { my ( $filter, $tag, $attr_name, $attr_val ) = @_; return undef | '' | 'value'; }
Attribute callbacks accept four parameters, the $filter object, the $tag name, the $attr_name and the $attr_value.
It should return either "undef" to reject the attribute, or the value to be used. An empty string keeps the attribute, but without a value.
BanAllBut => [qw(p div span)]
The logic works as follows:
* If BanAllBut exists, then ban everything but the tags in the list * Add to the ban list any elements in BanList * Any tags mentioned explicitly in Rules (eg a => 0, br => 1) are added or removed from the BanList * A default rule of { '*' => 0 } would ban all tags except those mentioned in Rules * A default rule of { '*' => 1 } would allow all tags except those disallowed in the ban list, or by explicit rules
This class provides the following methods:
The only reason for subclassing this module now is to add to the list of accepted tags, attributes and styles (See "WHITELIST INITIALIZATION METHODS"). Everything else can be achieved with "Rules".
The "HTML::StripScripts" class is subclassable. Filter objects are plain hashes and "HTML::StripScripts" reserves only hash keys that start with "_hss". The filter configuration can be set up by invoking the hss_init() method, which takes the same arguments as new().
The filter outputs a stream of start tags, end tags, text, comments, declarations and processing instructions, via the following "output_*" methods. Subclasses may override these to intercept the filter output.
The default implementations of the "output_*" methods pass the text on to the output() method. The default implementation of the output() method appends the text to a string, which can be fetched with the filtered_document() method once processing is complete.
If the output() method or the individual "output_*" methods are overridden in a subclass, then filtered_document() will not work in that subclass.
When the filter encounters something in the input document which it cannot transform into an acceptable construct, it invokes one of the following "reject_*" methods to put something in the output document to take the place of the unacceptable construct.
The TEXT parameter is the full text of the unacceptable construct.
The default implementations of these methods output an HTML comment containing the text "filtered". If "EscapeFiltered" is set to true, then the rejected text is HTML escaped instead.
Subclasses may override these methods, but should exercise caution. The TEXT parameter is unfiltered input and may contain malicious constructs.
The filter refers to various whitelists to determine which constructs are acceptable. To modify these whitelists, subclasses can override the following methods.
Each method is called once at object initialization time, and must return a reference to a nested data structure. These references are installed into the object, and used whenever the filter needs to refer to a whitelist.
The default implementations of these methods can be invoked as class methods.
See examples/tags/ and examples/declaration/ for examples of how to override these methods.
It is a hash, and the keys are context names, such as "Flow" and "Inline".
The values in the hash are hashrefs. The keys in these subhashes are lowercase tag names, and the values are context names, specifying the context that the tag provides to any other tags nested within it.
The special context "EMPTY" as a value in a subhash indicates that nothing can be nested within that tag.
It is a hash, and the keys are lowercase tag names.
The values in the hash are hashrefs. The keys in these subhashes are lowercase attribute names, and the values are attribute value class names, which are short strings describing the type of values that the attribute can take, such as "color" or "number".
The filter calls the attribute value validation subs with the following parameters:
The validation sub can return undef to indicate that the attribute should be removed from the tag, or it can return the new value for the attribute, in canonical form.
<b>hello <i>world</b> !</i>
Into:
<b>hello <i>world</i></b><i> !</i>
because both "b" and "i" appear as keys in the "DeInter" whitelist.
These methods transform attribute values and non-tag text from the input document into canonical form (see "CANONICAL FORM"), and transform text in canonical form into a suitable form for the output document.
The default implementation unescapes all entities that map to "US-ASCII" characters other than ampersand, and replaces any ampersands that don't form part of valid entities with "&".
The default behavior is the same as that of "text_to_canonical_form()", plus it converts any CR, LF or TAB characters to spaces.
The default implementation simply replaces all ampersands with "&", since that corresponds with the way most browsers treat entities in unquoted values.
The default implementation runs anything that doesn't look like a valid entity through the escape_html_metachars() method.
The default implementation converts CR, LF and TAB characters to a single space, and runs anything that doesn't look like a valid entity through the escape_html_metachars() method.
The default implementation allows only absolute "http" and "https" URLs, permits port numbers and query strings, and imposes reasonable length limits.
It does not URI escape the query string, and it does not guarantee properly formatted URIs, it just tries to give safe URIs. You can always use an attribute callback (see "Attribute Callbacks") to provide stricter handling.
This uses a lightweight regex and does not guarantee that email addresses are properly formatted. You can always use an attribute callback (see "Attribute Callbacks") to provide stricter handling.
The default implementation behaves as validate_href_attribute().
As well as the output, reject, init and cdata methods listed above, it might make sense for subclasses to override the following methods:
The default implementation does no filtering.
The default implementation escapes a minimal set of metacharacters for security against XSS vulnerabilities. The set of characters to escape is a compromise between the need for security and the need to ensure that the filter will work for documents in as many different character sets as possible.
Subclasses which make strong assumptions about the document character set will be able to escape much more aggressively.
The default implementation strips out only NULL characters, in order to avoid scrambling text for as many different character sets as possible.
Subclasses which make some sort of assumption about the character set in use will be able to have a much wider definition of a nonprintable character, and hence a more secure strip_nonprintable() implementation.
References to the following subs appear in the "AttVal" whitelist returned by the init_attval_whitelist() method.
Many of the methods described above deal with text from the input document, encoded in what I call "canonical form", defined as follows:
All characters other than ampersands represent themselves. Literal ampersands are encoded as "&". Non "US-ASCII" characters may appear as literals in whatever character set is in use, or they may appear as named or numeric HTML entities such as "æ", "穩" and "ÿ". Unknown named entities such as "&foo;" may appear.
The idea is to be able to be able to reduce input text to a minimal form, without making too many assumptions about the character set in use.
The following methods are internal to this class, and should not be invoked from elsewhere. Subclasses should not use or override these methods.
Returns undef if no filters are specified, in which case the attribute filter code has very little performance impact. If any rules are specified, then every tag and attribute is checked.
Checks for:
- a named attribute rule in a named tag - a default * attribute rule in a named tag - a named attribute rule in the default * rules - a default * attribute rule in the default * rules
Returns 1 if an allowed context is reached, or 0 if there's no reasonable way to get to an allowed context and the tag should just be rejected.
Such applications may benefit from using the more lightweight HTML::Scrubber::StripScripts module instead.
By default, filtered HTML may not be valid strict XHTML, for instance empty required attributes may be outputted. However, with "Rules", it should be possible to force the HTML to validate.
HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
Original author Nick Cleaton <nick@cleaton.net>
New code added and module maintained by Clinton Gormley <clint@traveljury.com>
Copyright (C) 2003 Nick Cleaton. All Rights Reserved.
Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
2023-01-24 | perl v5.36.0 |