Downloadable Data Management (astropy.utils.data
)#
Introduction#
A number of Astropy’s tools work with data sets that are either awkwardly
large (e.g., solar_system_ephemeris
) or
regularly updated (e.g., IERS_B
) or both
(e.g., IERS_A
). This kind of
data - authoritative data made available on the Web, and possibly updated
from time to time - is reasonably common in astronomy. The Astropy Project therefore
provides some tools for working with such data.
The primary tool for this is the astropy
cache. This is a repository of
downloaded data, indexed by the URL where it was obtained. The tool
download_file
and various other things built upon it can
use this cache to request the contents of a URL, and (if they choose to use the
cache) the data will only be downloaded if it is not already present in the
cache. The tools can be instructed to obtain a new copy of data
that is in the cache but has been updated online.
The astropy
cache is stored in a centralized place (on Linux machines by
default it is $HOME/.astropy/cache
; see Configuration System (astropy.config) for
more details). You can check its location on your machine:
>>> import astropy.config.paths
>>> astropy.config.paths.get_cache_dir()
'/home/burnell/.astropy/cache'
This centralization means that the cache is persistent and shared between all
astropy
runs in any virtualenv by one user on one machine (possibly more if
your home directory is shared between multiple machines). This can
dramatically accelerate astropy
operations and reduce the load on servers,
like those of the IERS, that were not designed for heavy Web traffic. If you
find the cache has corrupted or outdated data in it, you can remove an entry or
clear the whole thing with clear_download_cache
.
The files in the cache directory are named according to a cryptographic hash of their URL (currently MD5, so in principle malevolent entities can cause collisions, though the security risks this poses are marginal at most). The modification times on these files normally indicate when they were last downloaded from the Internet.
Usage Within Astropy#
For the most part, you can ignore the caching mechanism and rely on
astropy
to have the correct data when you need it. For example, precise
time conversions and sky locations need measured tables of the Earth’s
rotation from the IERS. The table IERS_Auto
provides
the infrastructure for many of these calculations. It makes available
Earth rotation parameters, and if you request them for a time more recent
than its tables cover, it will download updated tables from the IERS. So
for example asking what time it is in UT1 (a timescale that reflects the
irregularity of the Earth’s rotation) probably triggers a download of the
IERS data:
>>> from astropy.time import Time
>>> Time.now().ut1
Downloading https://maia.usno.navy.mil/ser7/finals2000A.all
|============================================| 3.2M/3.2M (100.00%) 1s
<Time object: scale='ut1' format='datetime' value=2019-09-22 08:39:03.812731>
But running it a second time does not require any new download:
>>> Time.now().ut1
<Time object: scale='ut1' format='datetime' value=2019-09-22 08:41:21.588836>
Some data is also made available from the Astropy data server either
for use within astropy
or for your convenience. These are available more
conveniently with the get_pkg_data_*
functions:
>>> from astropy.utils.data import get_pkg_data_contents
>>> print(get_pkg_data_contents("coordinates/sites-un-ascii"))
# these are all mappings from the name in sites.json (which is ASCII-only) to the "true" unicode names
TUBITAK->TÜBİTAK
Note
Sometimes when downloading files from internet resources secured with
TLS/SSL, you may get an exception regarding a certificate verification
error. Typically this indicates that Python could not find an
up-to-date collection of root certificates on your system. This is
especially common on Windows. This problem can usually be resolved
by installing the certifi package, which Astropy will use if
available to verify remote connections. In rare cases, certificate
verification may still fail if the remote server is misconfigured (e.g.,
with expired certificates). In this case, you may pass the
allow_insecure=True
argument to
download_file()
to allow the download with a
warning instead (not recommended unless you understand the potential
risks).
Usage From Outside Astropy#
Users of astropy
can also make use of astropy
’s caching and downloading
mechanism. In its simplest form, this amounts to using
download_file
with the cache=True
argument to obtain
their data, from the cache if the data is there:
>>> from astropy.utils.iers import IERS_B_URL, IERS_B
>>> from astropy.utils.data import download_file
>>> IERS_B.open(download_file(IERS_B_URL, cache=True))["year","month","day"][-3:]
<IERS_B length=3>
year month day
int64 int64 int64
----- ----- -----
2019 8 4
2019 8 5
2019 8 6
If users want to update the cache to a newer version of the
data (note that here the data was already up to date; users
will have to decide for themselves when to obtain new versions),
they can use the cache='update'
argument:
>>> IERS_B.open(download_file(IERS_B_URL,
... cache='update')
... )["year","month","day"][-3:]
Downloading http://hpiers.obspm.fr/iers/eop/eopc04/eopc04_IAU2000.62-now
|=========================================| 3.2M/3.2M (100.00%) 0s
<IERS_B length=3>
year month day
int64 int64 int64
----- ----- -----
2019 8 18
2019 8 19
2019 8 20
If they are concerned that the primary source of the data may be
overloaded or unavailable, they can use the sources
argument
to provide a list of sources to attempt downloading from, in order.
This need not include the original source. Regardless, the data
will be stored in the cache under the original URL requested:
>>> f = download_file("ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de405.bsp",
... cache=True,
... sources=['https://data.nanograv.org/static/data/ephem/de405.bsp',
... 'ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de405.bsp'])
Downloading ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de405.bsp from https://data.nanograv.org/static/data/ephem/de405.bsp
|========================================| 65M/ 65M (100.00%) 19s
Cache Management#
Because the cache is persistent, it is possible for it to become inconveniently
large, or become filled with irrelevant data. While it is simply a directory on
disk, each file is supposed to represent the contents of a URL, and many URLs
do not make acceptable on-disk filenames (for example, containing troublesome
characters like “:” and “~”). There is reason to worry that multiple
astropy
processes accessing the cache simultaneously might lead to cache
corruption. The data is therefore stored in a subdirectory named after the hash
of the URL, and write access is handled in a way that is resistant to
concurrency problems. So access to the cache is more convenient with a few
helpers provided by data
.
If your cache starts behaving oddly you can use
check_download_cache
to examine your cache contents and
raise an exception if it finds any anomalies. If a single file is undesired or
damaged, it can be removed by calling
clear_download_cache
with an argument that is the URL it
was obtained from, the filename of the downloaded file, or the hash of its
contents. Should the cache ever become badly corrupted,
clear_download_cache
with no arguments will simply delete
the whole directory, freeing the space and removing any inconsistent data. Of
course, if you remove data using either of these tools, any processes currently
using that data may be disrupted (or, under Windows, deleting the cache may not
be possible until those processes terminate). So use
clear_download_cache
with care.
To check the total space occupied by the cache, use
cache_total_size
. The contents of the cache can be
listed with get_cached_urls
, and the presence of a
particular URL in the cache can be tested with
is_url_in_cache
. More general manipulations can be
carried out using cache_contents
, which returns a
dict
mapping URLs to on-disk filenames of their contents.
If you want to transfer the cache to another computer, or preserve its contents
for later use, you can use the functions export_download_cache
to
produce a ZIP file listing some or all of the cache contents, and
import_download_cache
to load the astropy
cache from such a
ZIP file.
The Astropy cache has changed format - once in the Python 2 to Python
3 transition, and again before Astropy version 4.0.2 to resolve some
concurrency problems that arose on some compute clusters. Each version of the
cache is in its own subdirectory, so the old versions do not interfere with the
new versions and vice versa, but their contents are not used by this version
and are not cleared by clear_download_cache
. To remove
these old cache directories, you can run:
>>> from shutil import rmtree
>>> from os.path import join
>>> from astropy.config.paths import get_cache_dir
>>> rmtree(join(get_cache_dir(), 'download', 'py2'), ignore_errors=True)
>>> rmtree(join(get_cache_dir(), 'download', 'py3'), ignore_errors=True)
Using Astropy With Limited or No Internet Access#
You might want to use astropy
on a telescope control machine behind a strict
firewall. Or you might be running continuous integration (CI) on your astropy
server and want to avoid hammering astronomy servers on every pull request for
every architecture. Or you might not have access to US government or military
web servers. Whichever is the case, you may need to avoid astropy
needing data
from the Internet. There is no simple and complete solution to this problem at
the moment, but there are tools that can help.
Exactly which external data your project depends on will depend on what parts
of astropy
you use and how. The most general solution is to use a computer that
can access the Internet to run a version of your calculation that pulls in all of
the data files you will require, including sufficiently up-to-date versions of
files like the IERS data that update regularly. Then once the cache on this
connected machine is loaded with everything necessary, transport the cache
contents to your target machine by whatever means you have available, whether
by copying via an intermediate machine, portable disk drive, or some other
tool. The cache directory itself is somewhat portable between machines of the
same UNIX flavour; this may be sufficient if you can persuade your CI system to
cache the directory between runs. For greater portability, though, you can
simply use export_download_cache
and
import_download_cache
, which are portable and will allow
adding files to an existing cache directory.
If your application needs IERS data specifically, you can download the
appropriate IERS table, covering the appropriate time span, by any means you
find convenient. You can then load this file into your application and use the
resulting table rather than IERS_Auto
. In fact, the IERS
B table is small enough that a version (not necessarily recent) is bundled with
astropy
as astropy.utils.iers.IERS_B_FILE
. Using a specific non-automatic
table also has the advantage of giving you control over exactly which version
of the IERS data your application is using. See also Working offline.
If your issue is with certain specific servers, even if they are the ones
astropy
normally uses, if you can anticipate exactly which files will be needed
(or just pick up after astropy
fails to obtain them) and make those files
available somewhere else, you can request they be downloaded to the cache
using download_file
with the sources
argument set
to locations you know do work. You can also set sources
to an empty list
to ensure that download_file
does not attempt to use
the Internet at all.
If you have a particular URL that is giving you trouble, you can download it
using some other tool (e.g., wget
), possibly on another machine, and
then use import_file_to_cache
.
Astropy Data and Clusters#
Astronomical calculations often require the use of a large number of different processes on different machines with a shared home filesystem. This can pose certain complexities. In particular, if the many different processes attempt to download a file simultaneously this can overload a server or trigger security systems. The parallel access to the home directory can also trigger concurrency problems in the Astropy data cache, though we have tried to minimize these. We therefore recommend the following guidelines:
Do one of the following:
Write a simple script that sets
astropy.utils.iers.conf.auto_download = True
and then accesses all cached resources your code will need, including source name lookups and IERS tables. Run it on the head node from time to time (frequently enough to beat the timeoutastropy.utils.iers.conf.auto_max_age
, which defaults to 30 days) to ensure all data is up to date.Set
astropy.utils.iers.conf.auto_download = False
in your code and setastropy.utils.iers.conf.iers_degraded_accuracy
to either'warn'
or'ignore'
. These prevent the normal exception that occurs if a time conversion falls outside the bounds of available (local) IERS data. WARNING: only use this option if your application does not need full accuracy time conversions.Make an Astropy config file (see Configuration System (astropy.config)) that sets
astropy.utils.iers.conf.auto_download = False
so that the worker jobs will not suddenly notice an out-of-date table all at once and frantically attempt to download it.Optionally, in this file, set
astropy.utils.data.conf.allow_internet = False
to prevent any attempt to download any file from the worker nodes; if you do this, you will need to override this setting in your script that does the actual downloading.
Now your worker nodes should not need to obtain anything from the Internet and all should run smoothly.