DOKK / manpages / debian 11 / ncbi-entrez-direct / xtract.1.en
XTRACT(1) NCBI Entrez Direct User's Manual XTRACT(1)

xtract - NCBI Entrez Direct XML conversion and transformation tool

xtract [-help] [-strict] [-mixed] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-element element] [-first element] [-last element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-bit element] [-encode element] [-plain element] [-upper element] [-lower element] [-chain element] [-title element] [-year element] [-doi element] [-translate element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index] [-indices element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]

xtract converts an XML document into a table of data values according to user-specified rules.

Remove HTML and MathML tags.
Allow mixed content XML.
Delete Unicode accents and diacritical marks.
Convert Unicode to numeric HTML character entities.
Compress runs of spaces.
Retain stop words in selected phrases.

Read XML from file instead of standard input.
File of substitutions for -translate.

Name of record within set. Use of different argument names allows command-line control of nested looping.

Explore by list of adjacent object names.

DateRevised
Book/AuthorList
MedlineCitation/Article/Journal/JournalIssue/PubDate
"PubmedArticleSet/*"
"History/**"
"*/Taxon"
"**/Gene-commentary"

Element (or @attribute) must exist and satisfy any specified constraint.
Skip if element matches.
Preceding and following tests must both pass.
Any passing test suffices.
Execute if conditional test failed.
first/last/outer/inner/even/odd/all.

String must match exactly.
Substring must be present.
String must be present.
Substring must be at beginning.
Substring must be at end.
String must not match.
First string < second string.
First string > second string.
Matches without commas or semicolons.
Requires all words, but in any order.

Object values must match.
Object values must differ.

Greater than.
Greater than or equal to.
Less than to.
Less than or equal to.
Equal to.
Not equal to.

Override line break between patterns.
Replace tab character between fields.
Separator between group members.
Prefix to print before group.
Suffix to print after group.
Reset -sep through -elg.
Clear queued tab separator.
Preface combines -clr and -pfx.
Delete and replace queued tab separator.
Default placeholder for missing fields.
Insert arbitrary text.

XML tag for entire set.
XML tag for each record.
Wrap elements in XML object.
Encase instance in XML object.
Prologue to print before instance.
Epilogue to print after instance.
Package subset in XML object.
Foreword to print before subset.
Afterword to print after subset.

Print all items that match tag name.
Only print value of first item.
Only print value of last item.
-NAME
Record value in named variable.
Accumulate values into variable.

-element Constructs

Caption
Initials,LastName
MedlineCitation/PMID
"**/Gene-commentary_accession"
PubDate/*
DescriptorName@MajorTopicYN
MedlineDate[1:4]
"Title[phospholipase | rattlesnake]"
Object Count
"#Author"
"%Title"
"^PMID"
"&NAME"

"+"
Object Name
"+"
"*"
"$"
"@"

Count.
Length.
Sum.
Minimum.
Maximum.
Increment.
Decrement.
Difference.
Average.
Deviation.
Median.
Product.
Quotient.
Remainder.
Binary.
Bit count.

XML-encode <, >, &, ", and ' characters.
Remove embedded mixed-content markup tags.
Convert text to uppercase.
Convert text to lowercase.
Change spaces to underscores.
Capitalize initial letters of words.
Extract first 4-digit year from string.
Add https://doi.org/ prefix, URL encode.
Substitute values with -transform table.

Partition text at spaces.
Split at punctuation marks.
Adjacent informative words.
Rearrange words in sorted order.
Reverse words in string.
Separate individual letters.
Break at phrase separators.

Substitute text using regular expressions.
Target expression.
Replacement pattern.

Reverse complement nucleotide sequence.
Subrange determines forward or revcomp.
Split sequence into blocks of 50 letters.
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
Calculate molecular weight of peptide.

-0-based element
Zero-based.
-1-based element
One-based.
Half-open.

Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
complete/partial
CDS/mRNA/...[,...]
INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]

Collects data for sort-uniq-count(1) on entire set of records.

Create Entrez index XML.
Index normalized words.

Print before everything else.
Print after everything else.
Print before each record.
Print after each record.

Select record subset by conditions.
File of identifiers to use for selection.

Element to use as sort key.

Fast block copy (still applies processing flags).
Compress runs of spaces.
Suppress line indentation.
Indent according to nesting depth.
Place each attribute on a separate line.

Report XML data integrity problems.

Display outline of XML structure.
Display individual XML paths.
Display XML paths to leaf nodes (delimited by / by default).

Print usage information and some example argument combinations.
Complete examples of edirect(1) and xtract usage.
Illustrate common Unix command arguments.
Print version number.

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.

download-ncbi-data(1), edirect(1), esample(1), index-extras(1), index-pubmed(1), pm-index(1), pm-invert(1), pm-stash(1), rchive(1), sort-uniq-count(1), transmute(1), xml2tbl(1), xy-plot(1).

2021-03-07 NCBI