This command provides the creation of DOM trees in memory. In the
usual case a string containing a XML information is parsed and converted
into a DOM tree. Other possible parse input may be HTML or JSON. The
method indicates a specific subcommand.
dom parse $xml doc
$doc documentElement root
parses the XML in the variable xml, creates the DOM tree in
memory, make a reference to the document object, visible in Tcl as a
document object command, and assigns this new object name to the variable
doc. When doc gets freed, the DOM tree and the associated Tcl command object
(document and all node objects) are freed automatically.
set document [dom parse $xml]
set root [$document documentElement]
parses the XML in the variable xml, creates the DOM tree in
memory, make a reference to the document object, visible in Tcl as a
document object command, and returns this new object name, which is then
stored in document. To free the underlying DOM tree and the
associative Tcl object commands (document + nodes + fragment nodes) the
document object command has to be explicitly deleted by:
or
The valid options are:
- -simple
- If -simple is specified, a simple but fast parser is used (conforms
not fully to XML recommendation). That should double parsing and DOM
generation speed. The encoding of the data is not transformed inside the
parser. The simple parser does not respect any encoding information in the
XML declaration. It skips over the internal DTD subset and ignores any
information in it. Therefore it doesn't include defaulted attribute values
into the tree, even if the according attribute declaration is in the
internal subset. It also doesn't expand internal or external entity
references other than the predefined entities and character
references
- -html
- If -html is specified, a fast HTML parser is used, which tries to
even parse badly formed HTML into a DOM tree. If the HTML document given
to parse does not have a single root element (as it was legal up to HTML
4.01) and the -forest option is not used then a html node will be inserted
as document element, with the HTML input data top level elements as
childs.
- -html5
- This option is only available if tDOM was build with --enable-html5. Try
the featureinfo method if you need to know if this feature is build
in. If -html5 is specified, the gumbo lib html5 parser
(https://github.com/google/gumbo-parser) is used to build the DOM tree.
This is, as far as it goes, XML namespace-aware. Since this probably isn't
wanted by a lot of users and adds only burden for no good in a lot of use
cases -html5 can be combined with -ignorexmlns, in which
case all nodes and attributes in the DOM tree are not in an XML namespace.
All tag and attribute names in the DOM tree will be lower case, even for
foreign elements not in the xhtml, svg or mathml namespace. The DOM tree
may include nodes, that the parser inserted because they are implied by
the context (as <head>, <tbody>, etc.).
- -json
- If -json is specified, the data is expected to be a valid
JSON string (according to RFC 7159). The command returns an ordinary DOM
document with nesting token inside the JSON data translated into tree
hierarchy. If a JSON array value is itself an object or array then
container element nodes named (in a default build) arraycontainer or
objectcontainer, respectively, are inserted into the tree. The JSON
serialization of this document (with the domDoc method asJSON) is
the same JSON information as the data, preserving JSON datatypes,
allowing non-unique member names of objects while preserving their order
and the full range of JSON string values. JSON datatype handling is done
with an additional property "sticking" at the doc and tree
nodes. This property isn't contained in an XML serialization of the
document. If you need to store the JSON data represented by a document,
store the JSON serialization and parse it back from there. Apart from this
JSON type information the returned doc command or handle is an ordinary
DOM doc, which may be investigated or modified with the full range of the
doc and node methods. Please note that the element node names and the text
node values within the tree may be outside of what the appropriate XML
productions allow.
- -jsonroot
<document element name>
- If given makes the given element name the document element of the
resulting doc. The parsed content of the JSON string will be the children
of this document element node.
- -jsonmaxnesting
integer
- This option only has effect if used together with the -json option.
The current implementation uses a recursive descent JSON parser. In order
to avoid using excess stack space, any JSON input that has more than a
certain levels of nesting is considered invalid. The default maximum
nesting is 2000. The option -jsonmaxnesting allows the user to adjust
that.
- --
- The option -- marks the end of options. While respected in general
this option is only needed in case of parsing JSON data, which may start
with a "-".
- -keepEmpties
- If -keepEmpties is specified then text nodes which contain only
whitespaces will be part of the resulting DOM tree. In default case
(-keepEmpties not given) those empty text nodes are removed at
parsing time.
- -keepCDATA
- If -keepCDATA is specified then CDATA sections aren't added to the
tree as text nodes (and, if necessary, combined with sibling text nodes
into one text node) as without this option but are added as
CDATA_SECTION_NODEs to the tree. Please note that the resulting tree isn't
prepared for XPath selects or to be the source or the stylesheet of an
XSLT transformation. If not combined with -keepEmpties only not
whitespace only CDATA sections will be added to the resulting DOM
tree.
- -channel
<channel-ID>
- If -channel <channel-ID> is specified, the input to be parsed
is read from the specified channel. The encoding setting of the channel
(via fconfigure -encoding) is respected, ie the data read from the channel
are converted to UTF-8 according to the encoding settings before the data
is parsed.
- -baseurl
<baseURI>
- If -baseurl <baseURI> is specified, the baseURI is used as
the base URI of the document. External entities references in the document
are resolved relative to this base URI. This base URI is also stored
within the DOM tree.
- -feedbackAfter
<#bytes>
- If -feedbackAfter <#bytes> is specified, the tcl command
given by -feedbackcmd is evaluated at the first element start
within the document (or an external entity) after the start of the
document or external entity or the last such call after #bytes. For
backward compatibility if no -feedbackcmd is given but there is a tcl proc
named ::dom::domParseFeedback this proc is used as -feedbackcmd. If there
isn't such a proc and -feedbackAfter is used it is an error to not also
use -feedbackcmd. If the called script raises error, then parsing will be
aborted, the dom parse call returns error, with the script error
msg as error msg. If the called script return -code break, the
parsing will abort and the dom parse call will return the empty
string.
- -feedbackcmd
<script>
- If -feedbackcmd <script> is specified, the script
script is evaluated at the first element start within the document
(or an external entity) after the start of the document or external entity
or the last such call after #bytes value given by the
-feedbackAfter option. If -feedbackAfter isn't given, using
this option doesn't has any effect. If the called script raises error,
then parsing will be aborted, the dom parse call returns error,
with the script error msg as error msg. If the called script return
-code break, the parsing will abort and the dom parse
call will return the empty string.
- -externalentitycommand
<script>
- If -externalentitycommand <script> is specified, the
specified tcl script is called to resolve any external entities of the
document. The actual evaluated command consists of this option followed by
three arguments: the base uri, the system identifier of the entity and the
public identifier of the entity. The base uri and the public identifier
may be the empty list. The script has to return a tcl list consisting of
three elements. The first element of this list signals how the external
entity is returned to the processor. Currently the two allowed types are
"string" and "channel". The second element of the list
has to be the (absolute) base URI of the external entity to be parsed. The
third element of the list are data, either the already read data out of
the external entity as string in the case of type "string", or
the name of a tcl channel, in the case of type "channel". Note
that if the script returns a tcl channel, it will not be closed by the
processor. It must be closed separately if it is no longer needed.
- -useForeignDTD
<boolean>
- If <boolean> is true and the document does not have an external
subset, the parser will call the -externalentitycommand script with empty
values for the systemId and publicID arguments. Please note that if the
document also doesn't have an internal subset, the
-startdoctypedeclcommand and -enddoctypedeclcommand scripts, if set, are
not called.
- -paramentityparsing
<always|never|notstandalone>
- The -paramentityparsing option controls, if the parser tries to
resolve the external entities (including the external DTD subset) of the
document while building the DOM tree. -paramentityparsing requires
an argument, which must be either "always", "never",
or "notstandalone". The value "always" means that the
parser tries to resolves (recursively) all external entities of the XML
source. This is the default in case -paramentityparsing is omitted.
The value "never" means that only the given XML source is parsed
and no external entity (including the external subset) will be resolved
and parsed. The value "notstandalone" means, that all external
entities will be resolved and parsed, with the exception of documents,
which explicitly states standalone="yes" in their XML
declaration.
- -forest
- If this option is given, there is no need for a single root; any sequence
of well-formed, balanced subtrees will be parsed into a DOM tree. This
works for the expat DOM builder, the simple xml parser enabled with
-simple and the simple HTML parser enabled -with -html. If
used together with -json or -html5 this option is
ignored.
- -ignorexmlns
- It is recommended, that you only use this option with the -html5
option. If this option is given, no node within the created DOM tree will
be internally marked as placed into an XML Namespace, even if there is a
default namespace in scope for un-prefixed elements or even if the element
has a defined namespace prefix. One consequence is that XPath node
expressions on such a DOM tree doesn't work as may be expected. Prefixed
element nodes can't be selected naively and element nodes without prefix
will be seen by XPath expressions as if they are not in any namespace (no
matter if they are in fact should be in a default namespace). If you need
to inject prefixed node names into an XPath expression use the '%' syntax
described in the documentation of the of the
- domNode
- command method >selectNodes.
- -billionLaughsAttackProtectionMaximumAmplification
<float>
- <URL:
https://en.wikipedia.org/wiki/Billion_laughs_attack> This option
together with -billionLaughsAttackProtectionActivationThreshold
gives control over the parser limits that protects against billion laugh
attacks (). This option expects a float >= 1.0 as argument. You should
never need to use this option, because the default value (100.0) should
work for any real data. If you ever need to increase this value for
non-attack payload, please report.
- -billionLaughsAttackProtectionActivationThreshold
<long>
- <URL:
https://en.wikipedia.org/wiki/Billion_laughs_attack> This option
together with -billionLaughsAttackProtectionMaximumAmplification
gives control over the parser limits that protects against billion laugh
attacks (). This option expects a positiv integer as argument. You should
never need to use this option, because the default value (8388608) should
work for any real data. If you ever need to increase this value for
non-attack payload, please report.
If such command is invoked inside a script given as argument to
the domNode method appendFromScript or insertBeforeFromScript
it creates a new node and appends this node at the end of the child list of
the invoking element node. If the option -returnNodeCmd was given,
the command returns the created node as Tcl command. If this option was
omitted, the command returns nothing. Each command creates always the same
type of node. Which type of node is created by the command is determined by
the first argument to the createNodeCmd. The syntax of the created
command depends on the type of the node it creates.
If the command type to create is elementNode, the created
command will create an element node, if called. Without the -tagName
option the tag name of the created node is commandName without Tcl
namespace qualifiers. If the -tagName option was given then the
created command the created elements will have this tag name. If the
-jsonType option was given then the created node elements will have
the given JSON type. If the -namespace option is given the created
element node will be XML namespaced and in the namespace given by the
option. The element name will be literal as given either by the command name
or the -tagname option, if that was given. An appropriate XML
namespace declaration will be automatically added, to bind the prefix (if
the element name has one) or the default namespace (if the element name
hasn't a prefix) to the namespace if such a binding isn't in scope.
The syntax of the created command is:
elementNodeCmd ?attributeName attributeValue ...? ?script?
elementNodeCmd ?-attributeName attributeValue ...? ?script?
elementNodeCmd name_value_list script
The command syntax allows three different ways to specify the
attributes of the resulting element. These could be specified with
attributeName attributeValue argument pairs, in an
"option style" way with -attriubteName attributeValue
argument pairs (the '-' character is only syntactical sugar and will be
stripped off) or as a Tcl list with elements interpreted as attribute name
and the corresponding attribute value. The attribute name elements in the
list may have a leading '-' character, which will be stripped off.
Every elementNodeCmd accepts an optional Tcl script as last
argument. This script is evaluated as recursive appendFromScript
script with the node created by the elementNodeCmd as parent of all
nodes created by the script.
If the first argument of the method is textNode, the
command will create a text node. If the -jsonType option was given
then the created text node will have that JSON type. The syntax of the
created command is:
textNodeCmd ?-disableOutputEscaping? data
If the optional flag -disableOutputEscaping is given, the
escaping of the ampersand character (&) and the left angle bracket
(<) inside the data is disabled. You should use this flag carefully.
If the first argument of the method is commentNode or
cdataNode the command will create an comment node or CDATA section
node. The syntax of the created command is:
If the first argument of the method is piNode, the command
will create a processing instruction node. The syntax of the created command
is: