datalad addurls(1) | General Commands Manual | datalad addurls(1) |
datalad addurls - create and update a dataset from a list of URLs.
datalad addurls [-h] [-d DATASET] [-t TYPE] [-x REGEXP] [-m FORMAT] [--key FORMAT] [--message MESSAGE] [-n] [--fast] [--ifexists {overwrite|skip}] [--missing-value VALUE] [--nosave] [--version-urls] [-c PROC] [-J NJOBS] [--drop-after] URL-FILE URL-FORMAT FILENAME-FORMAT
Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a comma- or tab-separated file or properties for JSON) are available as placeholders. If URL-FILE is a CSV or TSV file, a positional index can also be used (i.e., "{0}" for the first column). Note that a placeholder cannot contain a ':' or '!'.
In addition, the FILENAME-FORMAT arguments has a few special placeholders.
- _repindex
The constructed file names must be unique across all fields rows. To
avoid collisions, the special placeholder "_repindex" can be added
to
the formatter. Its value will start at 0 and increment every time a
file name repeats.
- _url_hostname, _urlN, _url_basename*
Various parts of the formatted URL are available. Take
"http://datalad.org/asciicast/seamless_nested_repos.sh" as an
example.
"datalad.org" is stored as "_url_hostname". Components of
the URL's
path can be referenced as "_urlN". "_url0" and
"_url1" would map to
"asciicast" and "seamless_nested_repos.sh", respectively.
The final
part of the path is also available as "_url_basename".
This name is broken down further. "_url_basename_root" and
"_url_basename_ext" provide access to the root name and extension.
These values are similar to the result of os.path.splitext, but, in the
case of multiple periods, the extension is identified using the same
length heuristic that git-annex uses. As a result, the extension of
"file.tar.gz" would be ".tar.gz", not ".gz". In
addition, the fields
"_url_basename_root_py" and "_url_basename_ext_py"
provide access to
the result of os.path.splitext.
- _url_filename*
These are similar to _url_basename* fields, but they are obtained with
a server request. This is useful if the file name is set in the
Content-Disposition header.
Consider a file "avatars.csv" that contains::
who,ext,link
neurodebian,png,https://avatars3.githubusercontent.com/u/260793
datalad,png,https://avatars1.githubusercontent.com/u/8927200
To download each link into a file name composed of the 'who' and 'ext' fields, we could run::
$ datalad addurls -d avatar_ds --fast avatars.csv '{link}' '{who}.{ext}'
The `-d avatar_ds` is used to create a new dataset in "$PWD/avatar_ds".
If we were already in a dataset and wanted to create a new subdataset in an "avatars" subdirectory, we could use "//" in the FILENAME-FORMAT argument::
$ datalad addurls --fast avatars.csv '{link}' 'avatars//{who}.{ext}'
If the information is represented as JSON lines instead of comma separated values or a JSON array, you can use a utility like jq to transform the JSON lines into an array that addurls accepts::
$ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'
For users familiar with 'git annex addurl': A large part of this
plugin's functionality can be viewed as transforming data from
URL-FILE into a "url filename" format that fed to 'git annex addurl
--batch --with-files'.
datalad is developed by The DataLad Team and Contributors <team@datalad.org>.
2021-02-04 | datalad addurls 0.14.0 |