datalad foreach-dataset - run a command or Python code on
the dataset and/or each of its sub-datasets.
datalad foreach-dataset [-h]
[--cmd-type {auto|external|exec|eval}] [-d DATASET]
[--state {present|absent|any}] [-r] [-R LEVELS]
[--contains PATH] [--bottomup] [-s] [--output-streams
{capture|pass-through|relpath}] [--chpwd {ds|pwd}] [--safe-to-consume
{auto|all-subds-done|superds-done|always}] [-J NJOBS] [--version]
...
This command provides a convenience for the cases were no
dedicated DataLad command is provided to operate across the hierarchy of
datasets. It is very similar to `git submodule foreach` command with the
following major differences
- by default (unless --subdatasets-only) it would include
operation on the original dataset as well, - subdatasets could be traversed
in bottom-up order, - can execute commands in parallel (see JOBS option),
but would account for the order, e.g. in bottom-up order command is executed
in super-dataset only after it is executed in all subdatasets.
Additional notes:
- for execution of "external" commands we use the
environment used to execute external git and git-annex commands.
--cmd-type external: A few placeholders are supported in the
command via Python format specification:
- "{pwd}" will be replaced with the full path of the
current working directory. - "{ds}" and "{refds}" will
provide instances of the dataset currently operated on and the reference
"context" dataset which was provided via ``dataset`` argument. -
"{tmpdir}" will be replaced with the full path of a temporary
directory.
Aggressively git clean all datasets, running 5 parallel jobs::
% datalad foreach-dataset -r -J 5 git clean -dfx
- COMMAND
- command for execution. A leading '--' can be used to disambiguate this
command from the preceding options to DataLad. For --cmd-type exec or eval
only a single command argument (Python code) is supported.
- -h, --help,
--help-np
- show this help message. --help-np forcefully disables the use of a pager
for displaying the help message
- --cmd-type
{auto|external|exec|eval}
- type of the command. EXTERNAL: to be run in a child process using
dataset's runner; 'exec': Python source code to execute using 'exec(), no
value returned; 'eval': Python source code to evaluate using 'eval()',
return value is placed into 'result' field. 'auto': If used via Python
API, and `cmd` is a Python function, it will use 'eval', and otherwise
would assume 'external'. Constraints: value must be one of ('auto',
'external', 'exec', 'eval') [Default: 'auto']
- -d DATASET,
--dataset DATASET
- specify the dataset to operate on. If no dataset is given, an attempt is
made to identify the dataset based on the input and/or the current working
directory. Constraints: Value must be a Dataset or a valid identifier of a
Dataset (e.g. a path) or value must be NONE
- --state
{present|absent|any}
- indicate which (sub)datasets to consider: either only locally present,
absent, or any of those two kinds. Constraints: value must be one of
('present', 'absent', 'any') [Default: 'present']
- -r,
--recursive
- if set, recurse into potential subdatasets.
- -R LEVELS,
--recursion-limit LEVELS
- limit recursion into subdatasets to the given number of levels.
Constraints: value must be convertible to type 'int' or value must be
NONE
- --contains
PATH
- limit to the subdatasets containing the given path. If a root path of a
subdataset is given, the last considered dataset will be the subdataset
itself. This option can be given multiple times, in which case datasets
that contain any of the given paths will be considered. Constraints: value
must be a string or value must be NONE
- --bottomup
- whether to report subdatasets in bottom-up order along each branch in the
dataset tree, and not top-down.
- -s,
--subdatasets-only
- whether to exclude top level dataset. It is implied if a non-empty
CONTAINS is used.
- --output-streams
{capture|pass-through|relpath}, --o-s
{capture|pass-through|relpath}
- ways to handle outputs. 'capture' and return outputs from 'cmd' in the
record ('stdout', 'stderr'); 'pass-through' to the screen (and thus absent
from returned record); prefix with 'relpath' captured output (similar to
like grep does) and write to stdout and stderr. In 'relpath', relative
path is relative to the top of the dataset if DATASET is specified, and if
not - relative to current directory. Constraints: value must be one of
('capture', 'pass-through', 'relpath') [Default: 'pass-through']
- --chpwd
{ds|pwd}
- --safe-to-consume
{auto|all-subds-done|superds-done|always}
- Important only in the case of parallel (jobs greater than 1) execution.
'all-subds-done' instructs to not consider superdataset until command
finished execution in all subdatasets (it is the value in case of 'auto'
if traversal is bottomup). 'superds-done' instructs to not process
subdatasets until command finished in the super-dataset (it is the value
in case of 'auto' in traversal is not bottom up, which is the default).
With 'always' there is no constraint on either to execute in sub or super
dataset. Constraints: value must be one of ('auto', 'all-subds-done',
'superds-done', 'always') [Default: 'auto']
- -J NJOBS, --jobs
NJOBS
- how many parallel jobs (where possible) to use. "auto"
corresponds to the number defined by 'datalad.runtime.max-annex-jobs'
configuration item NOTE: This option can only parallelize input retrieval
(get) and output recording (save). DataLad does NOT parallelize your
scripts for you. Constraints: value must be convertible to type 'int' or
value must be NONE or value must be one of ('auto',)
- --version
- show the module and its version which provides the command
datalad is developed by The DataLad Team and Contributors
<team@datalad.org>.