DOKK / manpages / debian 10 / ncbi-entrez-direct / xtract.1.en
XTRACT(1) NCBI Entrez Direct User's Manual XTRACT(1)

xtract - convert XML into a table of data values

xtract [-help] [-strict] [-mixed] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-select condition] [-equals str] [-contains str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-plg str] [-elg str] [-rst] [-clr] [-pfc str] [-deq str] [-wrp tag] [-def str] [-lbl str] [-element element] [-first element] [-last element] [-NAME] [-num element] [-len element] [-sum element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-bin element] [-bit element] [-encode element] [-upper element] [-lower element] [-title element] [-year element] [-translate element] [-terms element] [-words element] [-pairs element] [-reverse element] [-letters element] [-clauses element] [-indices element] [-e2index] [-revcomp] [-nucleic] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-head str] [-tail str] [-hd str] [-tl str] [-format fmt] [-unicode style] [-script style] [-mathml terse] [-filter element action target] [-verify] [-outline] [-synopsis] [-skip filename] [-examples] [-version]

xtract converts an XML document into a table of data values according to user-specified rules.

Remove HTML and MathML tags.
Allow mixed content XML.
Delete Unicode accents and diacritical marks.
Convert Unicode to numeric HTML character entities.
Compress runs of spaces.
Retain stop words in selected phrases.

Read XML from file instead of standard input.
File of substitutions for -translate.

Name of record within set. Use of different argument names allows command-line control of nested looping.

DateRevised
Book/AuthorList
"PubmedArticleSet/*"
"History/**"
"*/Taxon"
"**/Gene-commentary"

Element (or @attribute) must exist and satisfy any specified constraint.
Skip if element matches.
Preceding and following tests must both pass.
Any passing test suffices.
Execute if conditional test failed.
first/last/outer/inner/even/odd/all.
Select record subset by conditions.

String must match exactly.
Substring must be present.
String must be present.
Substring must be at beginning.
Substring must be at end.
String must not match.

Greater than.
Greater than or equal to.
Less than to.
Less than or equal to.
Equal to.
Not equal to.

Override line break between patterns.
Replace tab character between fields.
Separator between group members.
Prefix to print before group.
Suffix to print after group.
Prologue to print once before elements.
Epilogue to print once after elements.
Reset -sep through -elg.
Clear queued tab separator.
Preface combines -clr and -pfx.
Delete and replace queued tab separator.
Wrap elements in XML object.
Default placeholder for missing fields.
Insert arbitrary text.

Print all items that match tag name.
Only print value of first item.
Only print value of last item.
-NAME
Record value in named variable.

-element Constructs

Caption
Initials,LastName
MedlineCitation/PMID
"**/Gene-commentary_accession"
PubDate/*
DescriptorName@MajorTopicYN
MedlineDate[1:4]
"Title[phospholipase | rattlesnake]"
Object Count
"#Author"
"%Title"
"^PMID"
"&NAME"

"+"
Object Name
"+"
"*"
"$"
"@"

Count.
Length.
Sum.
Minimum.
Maximum.
Increment.
Decrement.
Difference.
Average.
Deviation.
Median.
Binary.
Bit count.

URL-encode <, >, &, ", and ' characters.
Convert text to uppercase.
Convert text to lowercase.
Capitalize initial letters of words.
Extract first 4-digit year from string.
Substitute values with -transform table.

Partition text at spaces.
Split at punctuation marks.
Adjacent informative words.
Reverse words in string.
Separate individual letters.
Break at phrase separators.
Word pair index generation.
Create Entrez index XML.

Reverse-complement nucleotide sequence.
Subrange determines forward or revcomp.

-0-based element
Zero-based.
-1-based element
One-based.
Half-open.

Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
complete/partial
CDS/mRNA/...[,...]
INSDFeature_key/"#INSDInterval"/gene/product/... [...]

Print before everything else.
Print after everything else.
Print before each record.
Print after each record.

Keep records that contain a given phrase.
Keep records that do not contain a given phrase.

Fast block copy (still applies processing flags).
Compress runs of spaces.
Suppress line indentation.
Indent according to nesting depth.
Place each attribute on a separate line.
How to handle Unicode superscript and subscript digits (first converted to ASCII form in all cases).
Run them all together, with no additional markup.
Add spaces between digits in different positions.
Add periods between digits in different positions.
Surround superscripts by square brackets and subscripts by parentheses.
Surround superscripts with carets and subscripts with tildes.
Add backslashes when going up in height and forward slashes when going down.
Put superscripts in XML sup elements and subscripts in sub elements.
How to handle XML sup and sub elements (denoting superscripts and subscripts, respectively).
Surround superscripts by square brackets and subscripts by parentheses.
Surround superscripts with carets and subscripts with tildes.
Flatten MathML markup tersely.

Actions:
Keep matching elements (no-op).
Remove matching elements.
HTML-escape special characters.
Decode HTML escapes.
Compress runs of spaces.
Place each attribute on a separate line.
Strip off Unicode accents.

Targets:

Plain-text content.
CDATA blocks.
Comments.
The whole object.
Attributes.
Start and end tags.

Display outline of XML structure.
Display count of unique XML paths.

Print usage information and some example argument combinations.
Complete examples of edirect(1) and xtract usage.
Print version number.

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.

edirect(1), pm-index(1), pm-invert(1), pm-stash(1), rchive(1), transmute(1), xy-plot(1).

2019-02-26 NCBI