HTML::Defang - Cleans HTML as well as CSS of scripting and other
executable contents, and neutralises XSS attacks.
my $InputHtml = "<html><body></body></html>";
my $Defang = HTML::Defang->new(
context => $Self,
fix_mismatched_tags => 1,
tags_to_callback => [ br embed img ],
tags_callback => \&DefangTagsCallback,
url_callback => \&DefangUrlCallback,
css_callback => \&DefangCssCallback,
attribs_to_callback => [ qw(border src) ],
attribs_callback => \&DefangAttribsCallback,
content_callback => \&ContentCallback,
);
my $SanitizedHtml = $Defang->defang($InputHtml);
# Callback for custom handling specific HTML tags
sub DefangTagsCallback {
my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
# Explicitly defang this tag, eventhough safe
return DEFANG_ALWAYS if $lcTag eq 'br';
# Explicitly whitelist this tag, eventhough unsafe
return DEFANG_NONE if $lcTag eq 'embed';
# I am not sure what to do with this tag, so process as HTML::Defang normally would
return DEFANG_DEFAULT if $lcTag eq 'img';
}
# Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
sub DefangUrlCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
# Explicitly allow this URL in tag attributes or stylesheets
return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
# Explicitly defang this URL in tag attributes or stylesheets
return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
}
# Callback for custom handling style tags/attributes
sub DefangCssCallback {
my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
my $i = 0;
foreach (@$Selectors) {
my $SelectorRule = $$SelectorRules[$i];
foreach my $KeyValueRules (@$SelectorRule) {
foreach my $KeyValueRule (@$KeyValueRules) {
my ($Key, $Value) = @$KeyValueRule;
# Comment out any '!important' directive
$$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
# Comment out any 'position=fixed;' declaration
$$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
}
}
$i++;
}
}
# Callback for custom handling HTML tag attributes
sub DefangAttribsCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
# Change all 'border' attribute values to zero.
$$AttrValR = '0' if $lcAttrKey eq 'border';
# Defang all 'src' attributes
return DEFANG_ALWAYS if $lcAttrKey eq 'src';
return DEFANG_NONE;
}
# Callback for all content between tags (except <style>, <script>, etc)
sub DefangContentCallback {
my ($Self, $Defang, $ContentR) = @_;
$$ContentR =~ s/remove this content//;
}
This module accepts an input HTML and/or CSS string and removes
any executable code including scripting, embedded objects, applets, etc.,
and neutralises any XSS attacks. A whitelist based approach is used which
means only HTML known to be safe is allowed through.
HTML::Defang uses a custom html tag parser. The parser has been
designed and tested to work with nasty real world html and to try and
emulate as close as possible what browsers actually do with strange looking
constructs. The test suite has been built based on examples from a range of
sources such as http://ha.ckers.org/xss.html and
http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
possible XSS attack scenarios have been dealt with.
HTML::Defang can make callbacks to client code when it encounters
the following:
- When a specified tag is parsed
- When a specified attribute is parsed
- When a URL is parsed as part of an HTML attribute, or CSS property
value.
- When style data is parsed, as part of an HTML style attribute, or as part
of an HTML <style> tag.
The callbacks include details about the current tag/attribute that
is being parsed, and also gives a scalar reference to the input HTML.
Querying pos() on the input HTML should indicate where the module is
with parsing. This gives the client code flexibility in working with
HTML::Defang.
HTML::Defang can defang whole tags, any attribute in a tag, any
URL that appear as an attribute or style property, or any CSS declaration in
a declaration block in a style rule. This helps to precisely block the most
specific unwanted elements in the contents(for example, block just an
offending attribute instead of the whole tag), while retaining any safe
HTML/CSS.
- HTML::Defang->new(%Options)
- Constructs a new HTML::Defang object. The following options are
supported:
- Options
- tags_to_callback
- Array reference of tags for which a call back should be made. If a tag in
this array is parsed, the subroutine tags_callback() is
invoked.
- attribs_to_callback
- Array reference of tag attributes for which a call back should be made. If
an attribute in this array is parsed, the subroutine
attribs_callback() is invoked.
- tags_callback
- Subroutine reference to be invoked when a tag listed in @$tags_to_callback
is parsed.
- attribs_callback
- Subroutine reference to be invoked when an attribute listed in
@$attribs_to_callback is parsed.
- url_callback
- Subroutine reference to be invoked when a URL is detected in an HTML tag
attribute or a CSS property.
- css_callback
- Subroutine reference to be invoked when CSS data is found either as the
contents of a 'style' attribute in an HTML tag, or as the contents of a
<style> HTML tag.
- content_callback
- Subroutine reference to be invoked when standard content between HTML tags
in found.
- fix_mismatched_tags
- This property, if set, fixes mismatched tags in the HTML input. By
default, tags present in the default
%mismatched_tags_to_fix hash are fixed. This set
of tags can be overridden by passing in an array reference
$mismatched_tags_to_fix to the constructor. Any
opened tags in the set are automatically closed if no corresponding
closing tag is found. If an unbalanced closing tag is found, that is
commented out.
- mismatched_tags_to_fix
- Array reference of tags for which the code would check for matching
opening and closing tags. See the property
$fix_mismatched_tags.
- context
- You can pass an arbitrary scalar as a 'context' value that's then passed
as the first parameter to all callback functions. Most commonly this is
something like '$Self'
- allow_double_defang
- If this is true, then tag names and attribute names which already begin
with the defang string ("defang_" by default) will have an
additional copy of the defang string prepended if they are flagged to be
defanged by the return value of a callback, or if the tag or attribute
name is unknown.
The default is to assume that tag names and attribute names
beginning with the defang string are already made safe, and need no
further modification, even if they are flagged to be defanged by the
return value of a callback. Any tag or attribute modifications made
directly by a callback are still performed.
- delete_defang_content
- Normally defanged tags are turned into comments and prefixed by defang_,
and defanged styles are surrounded by /* ... */. If this is set to true,
then defanged content is deleted instead
- Debug
- If set, prints debugging output.
- HTML::Defang->new_bodyonly(%Options)
- Constructs a new HTML::Defang object that has the following implicit
options
Basically this is a easy way to remove all html boiler plate
content and return only the html body content.
- COMMON
PARAMETERS
- A number of the callbacks share the same parameters. These common
parameters are documented here. Certain variables may have specific
meanings in certain callbacks, so be sure to check the documentation for
that method first before referring this section.
- $context
- You can pass an arbitrary scalar as a 'context' value that's then passed
as the first parameter to all callback functions. Most commonly this is
something like '$Self'
- $Defang
- Current HTML::Defang instance
- $OpenAngle
- Opening angle(<) sign of the current tag.
- $lcTag
- Lower case version of the HTML tag that is currently being parsed.
- $IsEndTag
- Has the value '/' if the current tag is a closing tag.
- $AttributeHash
- A reference to a hash containing the attributes of the current tag and
their values. Each value is a scalar reference to the value, rather than
just a scalar value. You can add attributes (remember to make it a scalar
ref, eg $AttributeHash{"newattr"} =
\"newval"), delete attributes, or modify attribute values in
this hash, and any changes you make will be incorporated into the output
HTML stream.
The attribute values will have any entity references decoded
before being passed to you, and any unsafe values we be re-encoded back
into the HTML stream.
So for instance, the tag:
<div title="<"Hi there <">
Will have the attribute hash:
{ title => \q[<"Hi there <] }
And will be turned back into the HTML on output:
<div title="<"Hi there <">
- $CloseAngle
- Anything after the end of last attribute including the closing HTML
angle(>)
- $HtmlR
- A scalar reference to the input HTML. The input HTML is parsed using
m/\G$SomeRegex/c constructs, so to continue from where HTML:Defang left,
clients can use m/\G$SomeRegex/c for further processing on the input. This
will resume parsing from where HTML::Defang left. One can also use the
pos() function to determine where HTML::Defang left off. This
combined with the add_to_output() method should give reasonable
flexibility for the client to process the input.
- $OutR
- A scalar reference to the processed output HTML so far.
- tags_callback($context,
$Defang,
$OpenAngle,
$lcTag,
$IsEndTag,
$AttributeHash,
$CloseAngle,
$HtmlR,
$OutR)
- If $Defang->{tags_callback} exists, and
HTML::Defang has parsed a tag preset in
$Defang->{tags_to_callback}, the above callback
is made to the client code. The return value of this method determines
whether the tag is defanged or not. More details below.
- attribs_callback($context,
$Defang,
$lcTag,
$lcAttrKey,
$AttrVal,
$HtmlR,
$OutR)
- If $Defang->{attribs_callback} exists, and
HTML::Defang has parsed an attribute present in
$Defang->{attribs_to_callback}, the above
callback is made to the client code. The return value of this method
determines whether the attribute is defanged or not. More details
below.
- Method
parameters
- $lcAttrKey
- Lower case version of the HTML attribute that is currently being
parsed.
- $AttrVal
- Reference to the HTML attribute value that is currently being parsed.
See $AttributeHash for details of
decoding.
- Return
values
- DEFANG_NONE
- The current attribute will not be defanged.
- DEFANG_ALWAYS
- The current attribute will be defanged.
- DEFANG_DEFAULT
- The current attribute will be processed normally by HTML:Defang as if
there was no callback method specified.
- url_callback($context,
$Defang,
$lcTag,
$lcAttrKey,
$AttrVal,
$AttributeHash,
$HtmlR,
$OutR)
- If $Defang->{url_callback} exists, and
HTML::Defang has parsed a URL, the above callback is made to the client
code. The return value of this method determines whether the attribute
containing the URL is defanged or not. URL callbacks can be made from
<style> tags as well style attributes, in which case the particular
style declaration will be commented out. More details below.
- Method
parameters
- $lcAttrKey
- Lower case version of the HTML attribute that is currently being parsed.
However if this callback is made as a result of parsing a URL in a style
attribute, $lcAttrKey will be set to the string
style, or will be set to undef if this callback is made as a
result of parsing a URL inside a style tag.
- $AttrVal
- Reference to the URL value that is currently being parsed.
- $AttributeHash
- A reference to a hash containing the attributes of the current tag and
their values. Each value is a scalar reference to the value, rather than
just a scalar value. You can add attributes (remember to make it a scalar
ref, eg $AttributeHash{"newattr"} =
\"newval"), delete attributes, or modify attribute values in
this hash, and any changes you make will be incorporated into the output
HTML stream. Will be set to undef if the callback is made due to
URL in a <style> tag or attribute.
- Return
values
- DEFANG_NONE
- The current URL will not be defanged.
- DEFANG_ALWAYS
- The current URL will be defanged.
- DEFANG_DEFAULT
- The current URL will be processed normally by HTML:Defang as if there was
no callback method specified.
- css_callback($context,
$Defang,
$Selectors,
$SelectorRules,
$lcTag,
$IsAttr,
$OutR)
- If $Defang->{css_callback} exists, and
HTML::Defang has parsed a <style> tag or style attribtue, the above
callback is made to the client code. The return value of this method
determines whether a particular declaration in the style rules is defanged
or not. More details below.
- Method
parameters
- $Selectors
- Reference to an array containing the selectors in a style tag or
attribute.
- $SelectorRules
- Reference to an array containing the style declaration blocks of all
selectors in a style tag or attribute. Consider the below CSS:
a { b:c; d:e}
j { k:l; m:n}
The declaration blocks will get parsed into the following data
structure:
[
[
[ "b", "c", DEFANG_DEFAULT ],
[ "d", "e", DEFANG_DEFAULT ]
],
[
[ "k", "l", DEFANG_DEFAULT ],
[ "m", "n", DEFANG_DEFAULT ]
]
]
So, generally each property:value pair in a declaration is
parsed into an array of the form
["property", "value", X]
where X can be DEFANG_NONE, DEFANG_ALWAYS or DEFANG_DEFAULT,
and DEFANG_DEFAULT the default value. A client can manipulate this value
to instruct HTML::Defang to defang this property:value pair.
DEFANG_NONE - Do not defang
DEFANG_ALWAYS - Defang the style:property value
DEFANG_DEFAULT - Process this as if there is no callback
specified
- $IsAttr
- True if the currently processed item is a style attribute. False if the
currently processed item is a style tag.
- PUBLIC
METHODS
- defang($InputHtml,
\%Opts)
- Cleans up $InputHtml of any executable code
including scripting, embedded objects, applets, etc., and defang any XSS
attacks.
Returns the cleaned HTML. If fix_mismatched_tags is set, any tags
that appear in @$mismatched_tags_to_fix that are unbalanced are
automatically commented or closed.
- add_to_output($String)
- Appends $String to the output after the current
parsed tag ends. Can be used by client code in callback methods to add
HTML text to the processed output. If the HTML text needs to be defanged,
client code can safely call HTML::Defang->defang() recursively
from within the callback.
- Method
parameters
- $String
- The string that is added after the current parsed tag ends.
- INTERNAL
METHODS
- Generally these methods never need to be called by users of the class,
because they'll be called internally as the appropriate tags are
encountered, but they may be useful for some users in some cases.
- defang_script_tag($OutR,
$HtmlR,
$TagOps,
$OpenAngle,
$IsEndTag,
$Tag,
$TagTrail,
$Attributes,
$CloseAngle)
- This method is invoked when a <script> tag is parsed. Defangs the
<script> opening tag, and any closing tag. Any scripting content is
also commented out, so browsers don't display them.
Returns 1 to indicate that the <script> tag must be
defanged.
- Method
parameters
- $OutR
- A reference to the processed output HTML before the tag that is currently
being parsed.
- $HtmlR
- A scalar reference to the input HTML.
- $TagOps
- Indicates what operation should be done on a tag. Can be undefined,
integer or code reference. Undefined indicates an unknown tag to
HTML::Defang, 1 indicates a known safe tag, 0 indicates a known unsafe
tag, and a code reference indicates a subroutine that should be called to
parse the current tag. For example, <style> and <script> tags
are parsed by dedicated subroutines.
- $OpenAngle
- Opening angle(<) sign of the current tag.
- $IsEndTag
- Has the value '/' if the current tag is a closing tag.
- $Tag
- The HTML tag that is currently being parsed.
- $TagTrail
- Any space after the tag, but before attributes.
- $Attributes
- A reference to an array of the attributes and their values, including any
surrouding spaces. Each element of the array is added by 'push' calls like
below.
push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
- $CloseAngle
- Anything after the end of last attribute including the closing HTML
angle(>)
- defang_style_text($Content,
$lcTag,
$IsAttr,
$AttributeHash,
$HtmlR,
$OutR)
- Defang some raw css data and return the defanged content
- Method
parameters
- $Content
- The input style string that is defanged.
- $IsAttr
- True if $Content is from an attribute, otherwise
from a <style> block
- cleanup_style($StyleString)
- Helper function to clean up CSS data. This function directly operates on
the input string without taking a copy.
- defang_stylerule($SelectorsIn,
$StyleRules,
$lcTag,
$IsAttr,
$AttributeHash,
$HtmlR,
$OutR)
- Defangs style data.
- Method
parameters
- $SelectorsIn
- An array reference to the selectors in the style tag/attribute
contents.
- $StyleRules
- An array reference to the declaration blocks in the style tag/attribute
contents.
- $lcTag
- Lower case version of the HTML tag that is currently being parsed.
- $IsAttr
- Whether we are currently parsing a style attribute or style tag.
$IsAttr will be true if we are currently parsing a
style attribute.
- $HtmlR
- A scalar reference to the input HTML.
- $OutR
- A scalar reference to the processed output so far.
- defang_attributes($OutR,
$HtmlR,
$TagOps,
$OpenAngle,
$IsEndTag,
$Tag,
$TagTrail,
$Attributes,
$CloseAngle)
- Defangs attributes, defangs tags, does tag, attrib, css and url
callbacks.
- Method
parameters
- For a description of the method parameters, see documentation of
defang_script_tag() method
- cleanup_attribute($AttributeString)
- Helper function to cleanup attributes
<http://mailtools.anomy.net/>,
<http://htmlcleaner.sourceforge.net/>, HTML::StripScripts,
HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to
Rob Mueller <cpan@robm.fastmail.fm> for initial code, guidance and
support and bug fixes.
Copyright (C) 2003-2013 by FastMail Pty Ltd
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.