| MsOffice::Word::Surgeon::PackagePart(3pm) | User Contributed Perl Documentation | MsOffice::Word::Surgeon::PackagePart(3pm) |
MsOffice::Word::Surgeon::PackagePart - Operations on a single part within the ZIP package of a docx document
my $part = $surgeon->document; print $part->plain_text; $part->replace(qr[$pattern], $replacement_callback); $part->replace_image($image_alt_text, $image_PNG_content); $part->unlink_fields; $part->reveal_bookmarks;
This class is part of MsOffice::Word::Surgeon; it encapsulates operations for a single package part within the ZIP package of a ".docx" document. It is mostly used for the document part, that contains the XML representation of the main document body. However, other parts such as headers, footers, footnotes, etc. have the same internal representation and therefore the same operations can be invoked.
my $part = MsOffice::Word::Surgeon::PackagePart->new(
surgeon => $surgeon,
part_name => $name,
);
Constructor for a new part object. This is called internally from MsOffice::Word::Surgeon; it is not meant to be called directly by clients.
Constructor arguments
Other attributes
Other attributes, not passed through the constructor but generated lazily on demand, are :
Images without alternative text will not be accessible through the current Perl module.
Values of the hash are zip member names for the corresponding image representations in ".png" format.
contents
Returns a Perl string with the current internal XML representation of the part contents.
original_contents
Returns a Perl string with the XML representation of the part contents, as it was in the ZIP archive before any modification.
indented_contents
Returns an indented version of the XML contents, suitable for inspection in a text editor. This is produced by "toString" in XML::LibXML::Document and therefore is returned as an encoded byte string, not a Perl string.
plain_text
Returns the text contents of the part, without any markup. Paragraphs and breaks are converted to newlines, all other formatting instructions are ignored.
runs
Returns a list of MsOffice::Word::Surgeon::Run objects. Each of these objects holds an XML fragment; joining all fragments restores the complete document.
my $contents = join "", map {$_->as_xml} $self->runs;
cleanup_XML
$part->cleanup_XML(%args);
Apply several other methods for removing unnecessary nodes within the internal XML. This method successively calls "reduce_all_noises", "unlink_fields", "suppress_bookmarks" and "merge_runs".
Currently there is only one legal arg :
reduce_noise
$part->reduce_noise($regex1, $regex2, ...);
This method is used for removing unnecessary information in the XML markup. It applies the given list of regexes to the whole document, suppressing matches. The final result is put back into "$self->contents". Regexes may be given either as "qr/.../" references, or as names of builtin regexes (described below). Regexes are applied to the whole XML contents, not only to run nodes.
noise_reduction_regex
my $regex = $part->noise_reduction_regex($regex_name);
Returns the builtin regex corresponding to the given name. Known regexes are :
proof_checking => qr(<w:(?:proofErr[^>]+|noProof/)>), revision_ids => qr(\sw:rsid\w+="[^"]+"), complex_script_bold => qr(<w:bCs/>), page_breaks => qr(<w:lastRenderedPageBreak/>), language => qr(<w:lang w:val="[^/>]+/>), empty_run_props => qr(<w:rPr></w:rPr>), soft_hyphens => qr(<w:softHyphen/>),
reduce_all_noises
$part->reduce_all_noises;
Applies all regexes from the previous method.
merge_runs
$part->merge_runs(no_caps => 1); # optional arg
Walks through all runs of text within the document, trying to merge adjacent runs when possible (i.e. when both runs have the same properties, and there is no other XML node inbetween).
This operation is a prerequisite before performing replace operations, because documents edited in MsWord often have run boundaries across sentences or even in the middle of words; so regex searches can only be successful if those artificial boundaries have been removed.
If the argument "no_caps => 1" is present, the merge operation will also convert runs with the "w:caps" property, putting all letters into uppercase and removing the property; this makes more merges possible.
replace
$part->replace($pattern, $replacement, %replacement_args);
Replaces all occurrences of $pattern regex within the text nodes by the given $replacement. This is not exactly like a search-replace operation performed within MsWord, because the search does not cross boundaries of text nodes. In order to maximize the chances of successful replacements, the "cleanup_XML" method is automatically called before starting the operation.
The argument $pattern can be either a string or a reference to a regular expression. It should not contain any capturing parentheses, because that would perturb text splitting operations.
The argument $replacement can be either a fixed string, or a reference to a callback subroutine that will be called for each match.
The %replacement_args hash can be used to pass information to the callback subroutine. That hash will be enriched with three entries :
The callback subroutine may return either plain text or structured XML. See "SYNOPSIS" in MsOffice::Word::Surgeon::Run for an example of a replacement callback.
The following special keys within %replacement_args are interpreted by the replace() method itself, and therefore are not passed to the callback subroutine :
$part->replace($pattern, $replacement, cleanup_args => [no_caps => 1]);
bookmark_boundaries
my $boundaries = part->bookmark_boundaries; my ($boundaries, $final_xml) = part->bookmark_boundaries;
Parses the XML content to discover bookmark boundaries. In scalar context, returns an arrayref of MsOffice::Word::Surgeon::BookmarkBoundary objects. In list context, returns the arrayref followed by a plain string containing the final XML fragment.
suppress_bookmarks
$part->suppress_bookmarks(full_range => [qw/foo bar/], markup_only => qr/^_/);
Suppresses bookmarks according to the specified options :
Options may be specified as lists of strings, or regexes, or coderefs ... anything suitable to be compared through match::simple. In absence of any options, the default is "markup_only => qr/./", meaning that all bookmarks markup is suppressed.
Removing bookmarks is useful because MsWord may silently insert bookmarks in unexpected places; therefore some searches within the text may fail because of such bookmarks.
The "full_range" option is especially convenient for removing bookmarks associated with ASK fields. Such bookmarks contain ranges of text that are never displayed by MsWord.
reveal_bookmarks
$part->reveal_bookmarks(color => 'green');
Usually bookmarks boundaries in MsWord are not visible; the only way to have a visual clue is to turn on an option in Advanced / Show document content / Show bookmarks <https://support.microsoft.com/en-gb/office/troubleshoot-bookmarks-9cad566f-913d-49c6-8d37-c21e0e8d6db0> -- but this only displays where bookmarks start and end, without the names of the bookmarks.
The reveal_bookmarks() method will insert a visible run before each bookmark start and after each bookmark end, showing the bookmark name. This is an interesting tool for documenting where bookmarks are located in an existing document.
Options to this method are :
fields
my $fields = part->fields; my ($fields, $final_xml) = part->fields;
Parses the XML content to discover MsWord fields. In scalar context, returns an arrayref of MsOffice::Word::Surgeon::Field objects. In list context, returns the arrayref followed by a plain string containing the final XML fragment.
replace_fields
my $field_replacer = sub {my ($code, $result) = @_; return "...";};
$part->replace_fields($field_replacer);
Replaces MsWord fields by the product of the $field_replacer callback. The callback receives two arguments :
"IF { DOCPROPERTY foo } = "bar" "is bar" "is not bar"".
The callback should return an XML fragment suitable to be inserted within an MsWord run.
reveal_fields
$part->reveal_fields;
Replaces each field with a textual representation of its code instruction, embedded in curly braces.
unlink_fields
$part->unlink_fields;
Replaces each field with its current result, i.e removing the code instruction. This is the equivalent of performing Ctrl-Shift-F9 in MsWord on the whole document.
replace_image
$part->replace_image($image_alt_text, $image_PNG_content);
Replaces an existing PNG image by a new image. All features of the old image will be preserved (size, positioning, border, etc.) -- only the image itself will be replaced. The $image_alt_text must correspond to the alternative text set in Word for this image.
This operation replaces a ZIP member within the ".docx" file. If several XML nodes refer to the same ZIP member, i.e. if the same image is displayed at several locations, the new image will appear at all locations, even if they do not have the same alternative text -- unfortunately this module currently has no facility for duplicating an existing image into separate instances. So if your intent is to only replace one instance of the image, your original document should contain several distinct copies of the ".PNG" file.
add_image
my $rId = $part->add_image($image_PNG_content);
Stores the given PNG image within the ZIP file, adds it as a relationship to the current part, and returns the relationship id. This operation is not sufficient to make the image visible in Word : it just stores the image, but you still have to insert a proper "drawing" node in the contents XML, using the $rId. Future versions of this module may offer helper methods for that purpose; currently it must be done by hand.
Laurent Dami, <dami AT cpan DOT org<gt>
Copyright 2019-2024 by Laurent Dami.
This program is free software, you can redistribute it and/or modify it under the terms of the Artistic License version 2.0.
| 2025-05-16 | perl v5.40.1 |