datalad search(1) | datalad | datalad search(1) |
datalad search - search dataset metadata
datalad search [-h] [-d DATASET] [--reindex] [--max-nresults MAX_NRESULTS] [--mode {egrep,textblob,autofield}] [--full-record] [--show-keys {name,short,full}] [--show-query] [QUERY [QUERY ...]]
DataLad can search metadata extracted from a dataset and/or aggregated into a superdataset (see the AGGREGATE-METADATA command). This makes it possible to discover datasets, or individual files in a dataset even when they are not available locally.
Ultimately DataLad metadata are a graph of linked data structures. However, this command does not (yet) support queries that can exploit all information stored in the metadata. At the moment the following search modes are implemented that represent different trade-offs between the expressiveness of a query and the computational and storage resources required to execute a query.
- egrep (default)
- egrepcs [case-sensitive egrep]
- textblob
- autofield
An alternative default mode can be configured by tuning the configuration variable 'datalad.search.default-mode'::
[datalad "search"]
default-mode = egrepcs
Each search mode has its own default configuration for what kind of documents to query. The respective default can be changed via configuration variables::
[datalad "search"]
index-<mode_name>-documenttype = (all|datasets|files)
These search modes are largely ignorant of the metadata structure, and simply perform matching of a search pattern against a flat string-representation of metadata. This is advantageous when the query is simple and the metadata structure is irrelevant, or precisely known. Moreover, it does not require a search index, hence results can be reported without an initial latency for building a search index when the underlying metadata has changed (e.g. due to a dataset update). By default, these search modes only consider datasets and do not investigate records for individual files for speed reasons. Search results are reported in the order in which they were discovered.
Queries can make use of Python regular expression syntax (https://docs.python.org/3/library/re.html). In EGREP mode, matching is case-insensitive when the query does not contain upper case characters, but is case-sensitive when it does. In EGREPCS mode, matching is always case-sensitive. Expressions will match anywhere in a metadata string, not only at the start.
When multiple queries are given, all queries have to match for a search hit (AND behavior).
It is possible to search individual metadata key/value items by prefixing the query with a metadata key name, separated by a colon (':'). The key name can also be a regular expression to match multiple keys. A query match happens when any value of an item with a matching key name matches the query (OR behavior). See examples for more information.
Examples:
Query for (what happens to be) an author::
% datalad search haxby
Queries are case-INsensitive when the query contains no upper case characters, and can be regular expressions. Use EGREPCS mode when it is desired to perform a case-sensitive lowercase match::
% datalad search --mode egrepcs halchenko.*haxby
This search mode performs NO analysis of the metadata content. Therefore queries can easily fail to match. For example, the above query implicitly assumes that authors are listed in alphabetical order. If that is the case (which may or may not be true), the following query would yield NO hits::
% datalad search Haxby.*Halchenko
The TEXTBLOB search mode represents an alternative that is more robust in such cases.
For more complex queries multiple query expressions can be provided that all have to match to be considered a hit (AND behavior). This query discovers all files (non-default behavior) that match 'bids.type=T1w' AND 'nifti1.qform_code=scanner'::
% datalad -c datalad.search.index-egrep-documenttype=all search bids.type:T1w
nifti1.qform_code:scanner
Key name selectors can also be expressions, which can be used to select multiple keys or construct "fuzzy" queries. In such cases a query matches when any item with a matching key matches the query (OR behavior). However, multiple queries are always evaluated using an AND conjunction. The following query extends the example above to match any files that have either 'nifti1.qform_code=scanner' or 'nifti1.sform_code=scanner'::
% datalad -c datalad.search.index-egrep-documenttype=all search bids.type:T1w
nifti1.(q|s)form_code:scanner
This search mode is very similar to the EGREP mode, but with a few key differences. A search index is built from the string-representation of metadata records. By default, only datasets are included in this index, hence the indexing is usually completed within a few seconds, even for hundreds of datasets. This mode uses its own query language (not regular expressions) that is similar to other search engines. It supports logical conjunctions and fuzzy search terms. More information on this is available from the Whoosh project (search engine implementation):
- Description of the Whoosh query language:
http://whoosh.readthedocs.io/en/latest/querylang.html)
- Description of a number of query language customizations that
are
enabled in DataLad, such as, fuzzy term matching:
http://whoosh.readthedocs.io/en/latest/parsing.html#common-customizations
Importantly, search hits are scored and reported in order of descending relevance, hence limiting the number of search results is more meaningful than in the 'egrep' mode and can also reduce the query duration.
Examples:
Search for (what happens to be) two authors, regardless of the order in which those names appear in the metadata::
% datalad search --mode textblob halchenko haxby
Fuzzy search when you only have an approximate idea what you are looking for or how it is spelled::
% datalad search --mode textblob haxbi~
Very fuzzy search, when you are basically only confident about the first two characters and how it sounds approximately (or more precisely: allow for three edits and require matching of the first two characters)::
% datalad search --mode textblob haksbi~3/2
Combine fuzzy search with logical constructs::
% datalad search --mode textblob 'haxbi~ AND (hanke OR halchenko)'
This mode is similar to the 'textblob' mode, but builds a vastly more detailed search index that represents individual metadata variables as individual fields. By default, this search index includes records for datasets and individual fields, hence it can grow very quickly into a huge structure that can easily take an hour or more to build and require more than a GB of storage. However, limiting it to documents on datasets (see above) retains the enhanced expressiveness of queries while dramatically reducing the resource demands.
Examples:
List names of search index fields (auto-discovered from the set of indexed datasets)::
% datalad search --mode autofield --show-keys name
Fuzzy search for datasets with an author that is specified in a particular metadata field::
% datalad search --mode autofield bids.author:haxbi~ type:dataset
Search for individual files that carry a particular description prefix in their 'nifti1' metadata::
% datalad search --mode autofield nifti1.description:FSL* type:file
Search hits are returned as standard DataLad results. On the command line the '--output-format' (or '-f') option can be used to tweak results for further processing.
Examples:
Format search hits as a JSON stream (one hit per line)::
% datalad -f json search haxby
Custom formatting: which terms matched the query of particular results. Useful for investigating fuzzy search results::
$ datalad -f '{path}: {query_matched}' search --mode autofield
bids.author:haxbi~
datalad is developed by The DataLad Team and Contributors <team@datalad.org>.
2019-02-08 |