GETDATA(1) | User Contributed Perl Documentation | GETDATA(1) |
getData - retrieves databases from the Internet
getData [ --mirrordir <path> ] <list of db names>
getData --list
Bioinformatics has the intrinsic problem to bring the biological data to the end user. Astronomers have the equivalent problem and particle physicists, well, they haven come up with (first) the web and (second) the computational grids to address their problems. Debian helps with the programs but will not provide such huge datasets that are even frequently updated - not even in volatile.debian.org. Most bioinformatics researchers will not need too many of such databases. And even more so will gladly continue in using public services remotely.
For those who need a set of databases on a regular basis, this script shall be a start to automate the burden to download the data and update indices and the like. The world has seen such magic before with the Lion Biosciences Prisma tool (http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about something simpler (as a start) that at least gets close to what we desire and is Free. The aim must be to address the needs of all (most) communities, not only of the bioinformatics world. The seed was hence made with databases from astronomy.
Please contact the Debian-Med community if you consider this program to be almost ready for your needs and explain what still needs to be added. Public databases that you managed to integrate with this system are also very warmly welcomed as feedback.
this help
Unfortunately, the configuration was not yet be found to be modularised. It all needs to happen within the getData script itself.
Databases for download and their post-processing are specified at two different locations. One is the getData script itself, the other are files stored in /etc/getData.d. Either will define elements of a considerably large hash. The key is the identifier which is also shown by the 'getData --list' directive. The value is a reference to another hash, which assigns values to all the properties that a database has for its download and post-processing:
"post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ."
Some more effort has been put into TrEMBL for the merging of releases with subsequent updates and the indexing for EMBOSS:
"d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; " ."rm -rf \$d/trembl.dat; " ."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; " ."[ -x /usr/bin/dbxflat ] " . "&& cd \$d && " . "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto",
The dots are connecting strings in Perl. This helps the readability of the code. When writing these scripts, please be aware the newlines don't separate the individual commands here. Semicolon are required.
The following will list the identifiers and the descriptions of the first 4 databases that area available via getData on your system.
./getData --mirrordir=/local/databases/mirrored --list | head 4
To install any particular database, only give its name as an argument. If the installation is performed at another directory than the default, then the --mirrordir needs again to be set.
./getData swiss.dat
To remove the database again, give the script a hint with the --remove flag
./getData --remove swiss.dat
To perform the indexing only and circumvent the download (attention, this is dangerous since the index files will look newer than the database is), do
./getData --post swiss.dat
A special exception to these extra scripts is the --config flag in that it takes a list of extra arguments. Each shall denote a particular system that this database may be of interest for. There are today two systems supported:
We now need a mechanism with which packages can specify hooks that shall be called upon an update of a database. But we cannot assume that every indexing that can be performed because of the installation of some package is also desired by the user. How to configure this properly is left to be decided.
http://debian-med.alioth.debian.org, http://wiki.debian.org/DebianMed, /etc/getData.conf
This script was prepared by Steffen Moeller <moeller@debian.org> and Charles Plessy <debian-no-spam@plessy.org> and is distributed under the terms of the GNU Public License (GPL). On Debian systems, this license can be found under /usr/share/common-licenses/GPL.
2020-11-29 | perl v5.32.0 |