INTAKE(1) | intake | INTAKE(1) |
intake - Intake Documentation
Taking the pain out of data access and distribution
Intake is a lightweight package for finding, investigating, loading and disseminating data. It will appeal to different groups for some of the reasons below, but is useful for all and acts as a common platform that everyone can use to smooth the progression of data from developers and providers to users.
Intake contains the following main components. You do not need to use them all! The library is modular, only use the parts you need:
See the executable tutorial:
https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_scientist.ipynb
See the executable tutorial:
https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_engineer.ipynb
See the executable tutorial:
https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdev.ipynb
The Start here document contains the sections that all users new to Intake should read through. Use Cases - I want to... shows specific problems that Intake solves. For a brief demonstration, which you can execute locally, go to Quickstart. For a general description of all of the components of Intake and how they fit together, go to Overview. Finally, for some notebooks using Intake and articles about Intake, go to Examples and intake-examples. These and other documentation pages will make reference to concepts that are defined in the Glossary.
These documents will familiarise you with Intake, show you some basic usage and examples, and describe Intake's place in the wider python data world.
This guide will show you how to get started using Intake to read data, and give you a flavour of how Intake feels to the Data User. It assumes you are working in either a conda or a virtualenv/pip environment. For notebooks with executable code, see the Examples. This walk-through can be run from a notebook or interactive python session.
If you are using Anaconda or Miniconda, install Intake with the following commands:
conda install -c conda-forge intake
If you are using virtualenv/pip, run the following command:
pip install intake
Note that this will install with the minimum of optional requirements. If you want a more complete install, use intake[complete] instead.
Let's begin by creating a sample data set and catalog. At the command line, run the intake example command. This will create an example data Catalog and two CSV data files. These files contains some basic facts about the 50 US states, and the catalog includes a specification of how to load them.
Data sources can be created directly with the open_*() functions in the intake module. To read our example data:
>>> import intake >>> ds = intake.open_csv('states_*.csv') >>> print(ds) <intake.source.csv.CSVSource object at 0x1163882e8>
Each open function has different arguments, specific for the data format or service being used.
Intake reads data into memory using containers you are already familiar with:
To find out what kind of container a data source will produce, inspect the container attribute:
>>> ds.container 'dataframe'
The result will be dataframe, ndarray, or python. (New container types will be added in the future.)
For data that fits in memory, you can ask Intake to load it directly:
>>> df = ds.read() >>> df.head()
state slug code nickname ... 0 Alabama alabama AL Yellowhammer State 1 Alaska alaska AK The Last Frontier 2 Arizona arizona AZ The Grand Canyon State 3 Arkansas arkansas AR The Natural State 4 California california CA Golden State
Many data sources will also have quick-look plotting available. The attribute .plot will list a number of built-in plotting methods, such as .scatter(), see Plotting.
Intake data sources can have partitions. A partition refers to a contiguous chunk of data that can be loaded independent of any other partition. The partitioning scheme is entirely up to the plugin author. In the case of the CSV plugin, each .csv file is a partition.
To read data from a data source one chunk at a time, the read_chunked() method returns an iterator:
>>> for chunk in ds.read_chunked(): print('Chunk: %d' % len(chunk)) ... Chunk: 24 Chunk: 26
Working with large datasets is much easier with a parallel, out-of-core computing library like Dask. Intake can create Dask containers (like dask.dataframe) from data sources that will load their data only when required:
>>> ddf = ds.to_dask() >>> ddf Dask DataFrame Structure:
admission_date admission_number capital_city capital_url code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url slug state state_flag_url state_seal_url twitter_url website npartitions=2
object int64 object object object object object object object object int64 int64 object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Dask Name: from-delayed, 4 tasks
The Dask containers will be partitioned in the same way as the Intake data source, allowing different chunks to be processed in parallel. Please read the Dask documentation to understand the differences when working with Dask collections (Bag, Array or Data-frames).
A Catalog is an inventory of data sources, with the type and arguments prescribed for each, and arbitrary metadata about each source. In the simplest case, a catalog can be described by a file in YAML format, a "Catalog file". In real usage, catalogues can be defined in a number of ways, such as remote files, by connecting to a third-party data service (e.g., SQL server) or through an Intake Server protocol, which can implement any number of ways to search and deliver data sources.
The intake example command, above, created a catalog file with the following YAML-syntax content:
sources:
states:
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/states_*.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'
To load a Catalog from a Catalog file:
>>> cat = intake.open_catalog('us_states.yml') >>> list(cat) ['states']
This catalog contains one data source, called states. It can be accessed by attribute:
>>> cat.states.to_dask()[['state','slug']].head()
state slug 0 Alabama alabama 1 Alaska alaska 2 Arizona arizona 3 Arkansas arkansas 4 California california
Placing data source specifications into a catalog like this enables declaring data sets in a single canonical place, and not having to use boilerplate code in each notebook/script that makes use of the data. The catalogs can also reference one-another, be stored remotely, and include extra metadata such as a set of named quick-look plots that are appropriate for the particular data source. Note that catalogs are not restricted to being stored in YAML files, that just happens to be the simplest way to display them.
Many catalog entries will also contain "user_parameter" blocks, which are indications of options explicitly allowed by the catalog author, or for validation or the values passed. The user can customise how a data source is accessed by providing values for the user_parameters, overriding the arguments specified in the entry, or passing extra keyword arguments to be passed to the driver. The keywords that should be passed are limited to the user_parameters defined and the inputs expected by the specific driver - such usage is expected only from those already familiar with the specifics of the given format. In the following example, the user overrides the "csv_kwargs" keyword, which is described in the documentation for CSVSource and gets passed down to the CSV reader:
# pass extra kwargs understood by the csv driver >>> intake.cat.states(csv_kwargs={'header': None, 'skiprows': 1}).read().head()
0 1 ... 17 0 Alabama alabama ... https://twitter.com/alabamagov 1 Alaska alaska ... https://twitter.com/alaska
Note that, if you are creating such catalogs, you may well start by trying the open_csv command, above, and then use print(ds.yaml()). If you do this now, you will see that the output is very similar to the catalog file we have provided.
Intake makes it possible to create Data packages (pip or conda) that install data sources into a global catalog. For example, we can install a data package containing the same data we have been working with:
conda install -c intake data-us-states
Conda installs the catalog file in this package to $CONDA_PREFIX/share/intake/us_states.yml. Now, when we import intake, we will see the data from this package appear as part of a global catalog called intake.cat. In this particular case we use Dask to do the reading (which can handle larger-than-memory data and parallel processing), but read() would work also:
>>> import intake >>> intake.cat.states.to_dask()[['state','slug']].head()
state slug 0 Alabama alabama 1 Alaska alaska 2 Arizona arizona 3 Arkansas arkansas 4 California california
The global catalog is a union of all catalogs installed in the conda/virtualenv environment and also any catalogs installed in user-specific locations.
Intake checks the Intake config file for catalog_path or the environment variable "INTAKE_PATH" for a colon separated list of paths (semicolon on windows) to search for catalog files. When you import intake we will see all entries from all of the catalogues referenced as part of a global catalog called intake.cat.
A graphical data browser is available in the Jupyter notebook environment or standalone web-server. It will show the contents of any installed catalogs, plus allows for selecting local and remote catalogs, to browse and select entries from these. See GUI.
Here follows a list of specific things that people may want to get done, and details of how Intake can help. The details of how to achieve each of these activities can be found in the rest of the detailed documentation.
This is a very common pattern, if you want to load some specific data, to find someone, perhaps a colleague, who has accessed it before, and copy that code. Such a practice is extremely error prone, and cause a proliferation of copies of code, which may evolve over time, with various versions simultaneously in use.
Intake separates the concerns of data-source specification from code. The specs are stored separately, and all users can reference the one and only authoritative definition, whether in a shared file, a service visible to everyone or by using the Intake server. This spec can be updated so that everyone gets the current version instead of relying on outdated code.
Version control (e.g., using git) is an essential practice in modern software engineering and data science. It ensures that the change history is recorded, with times, descriptions and authors along with the changes themselves.
When data is specified using a well-structured syntax such as YAML, it can be checked into a version controlled repository in the usual fashion. Thus, you can bring rigorous practices to your data as well as your code.
If using conda packages to distribute data specifications, these come with a natural internal version numbering system, such that users need only do conda update ... to get the latest version.
Often, finding and grabbing data is a major hurdle to productivity. People may be required to download artifacts from various places or search through storage systems to find the specific thing that they are after. One-line commands which can retrieve data-source specifications or the files themselves can be a massive time-saver. Furthermore, each data-set will typically need its own code to be able to access it, and probably additional software dependencies.
Intake allows you to build conda packages, which can include catalog files referencing online resources, or to include data files directly in that package. Whether uploaded to anaconda.org or hosted on a private enterprise channel, getting the data becomes a single conda install ... command, whereafter it will appear as an entry in intake.cat. The conda package brings versioning and dependency declaration for free, and you can include any code that may be required for that specific data-set directly in the package too.
Individual data-sets often may be static, but commonly, the "best" data to get a job done changes with time as new facts emerge. Conversely, the very same data might be better stored in a different format which is, for instance, better-suited to parallel access in the cloud. In such situations, you really don't want to force all the data scientists who rely on it to have their code temporarily broken and be forced to change this code.
By working with a catalog file/service in a fixed shared location, it is possible to update the data source specs in-place. When users now run their code, they will get the latest version. Because all Intake drivers have the same API, the code using the data will be identical and not need to be changed, even when the format has been updated to something more optimised.
Services such as AWS S3, GCS and Azure Datalake (or private enterprise variants of these) are increasingly popular locations to amass large amounts of data. Not only are they relatively cheap per GB, but they provide long-term resilience, metadata services, complex access control patterns and can have very large data throughput when accessed in parallel by machines on the same architecture.
Intake comes with integration to cloud-based storage out-of-the box for most of the file-based data formats, to be able to access the data directly in-place and in parallel. For the few remaining cases where direct access is not feasible, the caching system in Intake allows for download of files on first use, so that all further access is much faster.
The era of Big Data is here! The term means different things to different people, but certainly implies that an individual data-set is too large to fit into the memory of a typical workstation computer (>>10GB). Nevertheless, most data-loading examples available use functions in packages such as pandas and expect to be able to produce in-memory representations of the whole data. This is clearly a problem, and a more general answer should be available aside from "get more memory in your machine".
Intake integrates with Dask and Spark, which both offer out-of-core computation (loading the data in chunks which fit in memory and aggregating result) or can spread their work over a cluster of machines, effectively making use of the shared memory resources of the whole cluster. Dask integration is built into the majority of the the drivers and exposed with the .to_dask() method, and Spark integration is available for a small number of drivers with a similar .to_spark() method, as well as directly with the intake-spark package.
Intake also integrates with many data services which themselves can perform big-data computations, only extracting the smaller aggregated data-sets that do fit into memory for further analysis. Services such as SQL systems, solr, elastic-search, splunk, accumulo and hbase all can distribute the work required to fulfill a query across many nodes of a cluster.
Browsing for the data-set which will solve a particular problem can be hard, even when the data have been curated and stored in a single, well-structured system. You do not want to rely on word-of-mouth to specify which data is right for which job.
Intake catalogs allow for self-description of data-sets, with simple text and arbitrary metadata, with a consistent access pattern. Not only can you list the data available to you, but you can find out what exactly that data represents, and the form the data would take if loaded (table versus list of items, for example). This extra metadata is also searchable: you can descend through a hierarchy of catalogs with a single search, and find all the entries containing some particular keywords.
You can use the Intake GUI to graphically browse through your available data-sets or point to catalogs available to you, look through the entries listed there and get information about each, or even show a sample of the data or quick-look plots. The GUI is also able to execute searches and browse file-systems to find data artifacts of interest. This same functionality is also available via a command-line interface or programmatically.
Interacting with cloud storage resources is very convenient, but you will not want to download large amounts of data to your laptop or workstation for analysis. Intake finds itself at home in the remote-execution world of jupyter and Anaconda Enterprise and other in-browser technologies. For instance, you can run the Intake GUI either as a stand-alone application for browsing data-sets or in a notebook for full analytics, and have all the runtime live on a remote machine, or perhaps a cluster which is co-located with the data storage. Together with cloud-optimised data formats such as parquet, this is an ideal set-up for processing data at web scale.
A massive amount of data exists in human-readable formats such as JSON, XML and CSV, which are not very efficient in terms of space usage and need to be parsed on load to turn into arrays or tables. Much faster processing times can be had with modern compact, optimised formats, such as parquet.
Intake has a "persist" mechanism to transform any input data-source into the format most appropriate for that type of data, e.g., parquet for tabular data. The persisted data will be used in preference at analysis time, and the schedule for updating from the original source is configurable. The location of these persisted data-sets can be shared with others, so they can also gain the benefits, or the "export" variant can be used to produce an independent version in the same format, together with a spec to reference it by; you would then share this spec with others.
Security is important. Users' identity and authority to view specific data should be established before handing over any sensitive bytes. It is, unfortunately, all too common for data scientists to include their username, passwords or other credentials directly in code, so that it can run automatically, thus presenting a potential security gap.
Intake does not manage credentials or user identities directly, but does provide hooks for fetching details from the environment or other service, and using the values in templating at the time of reading the data. Thus, the details are not included in the code, but every access still requires for them to be present.
In other cases, you may want to require the user to provide their credentials every time, rather that automatically establish them, and "user parameters" can be specified in Intake to cover this case.
The Intake server protocol allows you fine-grained control over the set of data sources that are listed, and exactly what to return to a user when they want to read some of that data. This is an ideal opportunity to include authorisation checks, audit logging, and any more complicated access patterns, as required.
By streaming the data through a single channel on the server, rather than allowing users direct access to the data storage backend, you can log and verify all access to your data.
It is desirable to separate out two tasks: the definition of data-source specifications, and accessing and using data. This is so that those who understand the origins of the data and the implications of various formats and other storage options (such as chunk-size) should make those decisions and encode what they have done into specs. It leaves the data users, e.g., data scientists, free to find and use the data-sets appropriate for their work and simply get on with their job - without having to learn about various storage formats and access APIs.
This separation is at the very core of what Intake was designed to do.
Data formats and services are a wide mess of many libraries and APIs. A large amount of time can be wasted in the life of a data scientist or engineer in finding out the details of the ones required by their work. Intake wraps these various libraries, REST APIs and similar, to provide a consistent experience for the data user. source.read() will simply get all of the data into memory in the container type for that source - no further parameters or knowledge required.
Even for the curator of data catalogs or data driver authors, the framework established by Intake provides a lot of convenience and simplification which allows each person to deal with only the specifics of their job.
Having a bunch of files in some directory is a very common pattern for data storage in the wild. There may or may not be a README file co-located giving some information in a human-readable form, but generally not structured - such files are usually different in every case.
When a data source is encoded into a catalog, the spec offers a natural place to describe what that data is, along with the possibility to provide an arbitrary amount of structured metadata and to describe any parameters that are to be exposed for user choice. Furthermore, Intake data sources each have a particular container type, so that users know whether to expect a dataframe, array, etc., and simple introspection methods like describe and discover which return basic information about the data without having to load all of it into memory first.
Usually, the set of data sources held by an organisation have relationships to one another, and would be poorly served to be provided as a simple flat list of everything available. Intake allows catalogs to refer to other catalogs. This means, that you can group data sources by various facets (type, department, time...) and establish hierarchical data-source trees within which to find the particular data most likely to be of interest. Since the catalogs live outside and separate from the data files themselves, as many hierarchy structures as thought useful could be created.
For even more complicated data source meta-structures, it is possible to store all the details and even metadata in some external service (e.g., traditional SQL tables) with which Intake can interact to perform queries and return particular subsets of the available data sources.
There are already several catalog-like data services in existence in the world, and some organisation may have several of these in-house for various different purposes. For example, an SQL server may hold details of customer lists and transactions, but historical time-series and reference data may be held separately in archival data formats like parquet on a file-storage system; while real-time system monitoring is done by a totally unrelated system such as Splunk or elastic search.
Of course, Intake can read from various file formats and data services. However, it can also interpret the internal conception of data catalogs that some data services may have. For example, all of the tables known to the SQL server, or all of the pre-defined queries in Splunk can be automatically included as catalogs in Intake, and take their place amongst the regular YAML-specified data sources, with exactly the same usage for all of them.
These data sources and their hierarchical structure can then be exposed via the graphical data browser, for searching, selecting and visualising data-sets.
Intake is integrated with the comprehensive holoviz suite, particularly hvplot, to bring simple yet powerful data visualisations to any Intake data source by using just one single method for everything. These plots are interactive, and can include server-side dynamic aggregation of very large data-sets to display more data points than the browser can handle.
You can specify specific plot types right in the data source definition, to have these customised visualisations available to the user as simple one-liners known to reveal the content of the data, or even view the same visuals right in the graphical data source browser application. Thus, Intake is already an all-in-one data investigation and dashboarding app.
Intake data catalogs are not limited to reading static specification from files. They can also execute queries on remote data services and return lists of data sources dynamically at runtime. New data sources may appear, for example, as directories of data files are pushed to a storage service, or new tables are created within a SQL server.
Sometimes, the well-known data formats are just not right for a given data-set, and a custom-built format is required. In such cases, the code to read the data may not exist in any library. Intake allows for code to be distributed along with data source specs/catalogs or even files in a single conda package. That encapsulates everything needed to describe and use that particular data, and can then be distributed as a single entity, and installed with a one-liner.
Furthermore, should the few builtin container types (sequence, array, dataframe) not be sufficient, you can supply your own, and then build drivers that use it. This was done, for example, for xarray-type data, where multiple related N-D arrays share a coordinate system and metadata. By creating this container, a whole world of scientific and engineering data was opened up to Intake. Creating new containers is not hard, though, and we foresee more coming, such as machine-learning models and streaming/real-time data.
If you have a set of files or a data service which you wish to make into a data-set, so that you can include it in a catalog, you should use the set of functions intake.open_*, where you need to pick the function appropriate for your particular data. You can use tab-completion to list the set of data drivers you have installed, and find others you may not yet have installed at Plugin Directory. Once you have determined the right set of parameters to load the data in the manner you wish, you can use the source's .yaml() method to find the spec that describes the source, so you can insert it into a catalog (with appropriate description and metadata). Alternatively, you can open a YAML file as a catalog with intake.open_catalog and use its .add() method to insert the source into the corresponding file.
If, instead, you have data in your session in one of the containers supported by Intake (e.g., array, data-frame), you can use the intake.upload() function to save it to files in an appropriate format and a location you specify, and give you back a data-source instance, which, again, you can use with .yaml() or .add(), as above.
This page describes the technical design of Intake, with brief details of the aims of the project and components of the library
Intake solves a related set of problems:
Intake has the explicit goal of not defining a computational expression system. Intake plugins load the data into containers (e.g., arrays or data-frames) that provide their data processing features. As a result, it is very easy to make a new Intake plugin with a relatively small amount of Python.
Intake is a Python library for accessing data in a simple and uniform way. It consists of three parts:
1. A lightweight plugin system for adding data loader drivers for new file formats and servers (like databases, REST endpoints or other cataloging services)
2. A cataloging system for specifying these sources in simple YAML syntax, or with plugins that read source specs from some external data service
3. A server-client architecture that can share data catalog metadata over the network, or even stream the data directly to clients if needed
Intake supports loading data into standard Python containers. The list can be easily extended, but the currently supported list is:
Additionally, Intake can load data into distributed data structures. Currently it supports Dask, a flexible parallel computing library with distributed containers like dask.dataframe, dask.array, and dask.bag. In the future, other distributed computing systems could use Intake to create similar data structures.
Intake is built out of four core concepts:
The business of a plugin is to go from some data format (bunch of files or some remote service) to a "Container" of the data (e.g., data-frame), a thing on which you can perform further analysis. Drivers can be used directly by the user, or indirectly through data catalogs. Data sources can be pickled, sent over the network to other hosts, and reopened (assuming the remote system has access to the required files or servers).
See also the Glossary.
Ongoing work for enhancements, as well as requests for plugins, etc., can be found at the issue tracker. See the Roadmap for general mid- and long-term goals.
Here we list links to notebooks and other code demonstrating the use of Intake in various scenarios. The first section is of general interest to various users, and the sections that follow tend to be more specific about particular features and workflows.
Many of the entries here include a link to Binder, which a service that lest you execute code live in a notebook environment. This is a great way to experience using Intake. It can take a while, sometimes, for Binder to come up; please have patience.
See also the examples repository, containing data-sets which can be built and installed as conda packages.
Tutorials delving deeper into the Internals of Intake, for those who wish to contribute
More specific examples of Intake functionality
These are Intake-related articles that may be of interest.
In the following sections, we will describe some of the ways in which Intake is used in real production systems. These go well beyond the typical YAML files presented in the quickstart and examples sections, which are necessarily short and simple, and do not demonstrate the full power of Intake.
This is the simplest scenario, and amply described in these documents. The primary advantage is simplicity: it is enough to put a file in an accessible place (even a gist or repo), in order for someone else to be able to discover and load that data. Furthermore, such files can easily refer to one-another, to build up a full tree of data assets with minimum pain Since YAML files are text, this also lends itself to working well with version control systems. Furthermore, all sources can describe themselves as YAML, and the export and upload commands can produce an efficient format (possibly remote) together with YAML definition in a single step.
The Pangeo collaboration uses Intake to catalog their data holdings, which are generally in various forms of netCDF-compliant formats, massive multi-dimensional arrays with data relating to earth and climate science and meteorology. On their cloud-based platform, containers start up jupyter-lab sessions which have Intake installed, and therefore can simply pick and load the data that each researcher needs - often requiring large Dask clusters to actually do the processing.
A static rendering of the catalog contents is available, so that users can browse the holdings without even starting a python session. This rendering is produced by CI on the repo whenever new definitions are added, and it also checks (using Intake) that each definition is indeed loadable.
Pangeo also developed intake-stac, which can talk to STAC servers to make real-time queries and parse the results into Intake data sources. This is a standard for spaceo-temporal data assets, and indexes massive amounts of cloud-stored data.
Intake will be the basis of the data access and cataloging service within Anaconda Enterprise, running as a micro-service in a container, and offering data source definitions to users. The access control, who gets to see which data-set, and serving of credentials to be able to read from the various data storage services, will all be handled by the platform and be fully configurable by admins.
NCAR has developed intake-esm, a mechanism for creating file-based Intake catalogs for climate data from project efforts such as the Coupled Model Intercomparison Project (CMIP) and the Community Earth System Model (CESM) Large Ensemble Project. These projects produce a huge of amount climate data persisted on tape, disk storage components across multiple (of the order ~300,000) netCDF files. Finding, investigating, loading these files into data array containers such as xarray can be a daunting task due to the large number of files a user may be interested in. Intake-esm addresses this issue in three steps:
`Dataset Catalog Curation`_
cat = intake.open_esm_metadatastore(catalog_input_definition="GLADE-CMIP5")
sub_cat = cat.search(variable=['hfls'], frequency='mon', modeling_realm='atmos', institute=['CCCma', 'CNRM-CERFACS'])
dsets = cat.to_xarray(decode_times=True, chunks={'time': 50})
The Bluesky project uses Intake to dynamically query a MongoDB instance, which holds the details of experimental and simulation data catalogs, to return a custom Catalog for every query. Data-sets can then be loaded into python, or the original raw data can be accessed ...
Zillow is developing Intake to meet the needs of their datalake access layer (DAL), to encapsulate the highly hierarchical nature of their data. Of particular importance, is the ability to provide different version (testing/production, and different storage formats) of the same logical dataset, depending on whether it is being read on a laptop versus the production infrastructure ...
The server protocol (see Server Protocol) is simple enough that anyone can write their own implementation with full customisation and behaviour. In particular, auth and monitoring would be essential for a production-grade deployment.
More detailed information about specific parts of Intake, such as how to author catalogs, how to use the graphical interface, plotting, etc.
Note: the GUI requires panel and bokeh to be available in the current environment.
The Intake top-level singleton intake.gui gives access to a graphical data browser within the Jupyter notebook. To expose it, simply enter it into a code cell (Jupyter automatically display the last object in a code cell). [image]
New instances of the GUI are also available by instantiating intake.interface.gui.GUI, where you can specify a list of catalogs to initially include.
The GUI contains three main areas:
Selecting a catalog from the list will display nested catalogs below the parent and display source entries from the catalog in the list of sources.
Below the lists of catalogs is a row of buttons that are used for adding, removing and searching-within catalogs:
The Add button (+) exposes a sub-panel with two main ways to add catalogs to the interface: [image]
This panel has a tab to load files from local; from that you can navigate around the filesystem using the arrow or by editing the path directly. Use the home button to get back to the starting place. Select the catalog file you need. Use the "Add Catalog" button to add the catalog to the list above. [image]
Another tab loads a catalog from remote. Any URL is valid here, including cloud locations, "gcs://bucket/...", and intake servers, "intake://server:port". Without a protocol specifier, this can be a local path. Again, use the "Add Catalog" button to add the catalog to the list above. [image]
Finally, you can add catalogs to the interface in code, using the .add() method, which can take filenames, remote URLs or existing Catalog instances.
The Remove button (-) deletes the currently selected catalog from the list. It is important to note that this action does not have any impact on files, it only affects what shows up in the list. [image]
The sub-panel opened by the Search button (🔍) allows the user to search within the selected catalog [image]
From the Search sub-panel the user enters for free-form text. Since some catalogs contain nested sub-catalogs, the Depth selector allows the search to be limited to the stated number of nesting levels. This may be necessary, since, in theory, catalogs can contain circular references, and therefore allow for infinite recursion. [image]
Upon execution of the search, the currently selected catalog will be searched. Entries will be considered to match if any of the entered words is found in the description of the entry (this is case-insensitive). If any matches are found, a new entry will be made in the catalog list, with the suffix "_search". [image]
Selecting a source from the list updates the description text on the left-side of the gui.
Below the list of sources is a row of buttons for inspecting the selected data source:
The Plot button (📊) opens a sub-panel with an area for viewing pre-defined plots. [image]
These plots are specified in the catalog yaml and that yaml can be displayed by checking the box next to "show yaml". [image]
The holoviews object can be retrieved from the gui using intake.interface.source.plot.pane.object, and you can then use it in Python or export it to a file.
If you have installed the optional extra packages dfviz and xrviz, you can interactively plot your dataframe or array data, respectively. [image]
The button "customize" will be available for data sources of the appropriate type. Click this to open the interactive interface. If you have not selected a predefined plot (or there are none), then the interface will start without any prefilled values, but if you do first select a plot, then the interface will have its options pre-filled from the options
For specific instructions on how to use the interfaces (which can also be used independently of the Intake GUI), please navigate to the linked documentation.
Note that the final parameters that are sent to hvPlot to produce the output each time a plot if updated, are explicitly available in YAML format, so that you can save the state as a "predefined plot" in the catalog. The same set of parameters can also be used in code, with datasource.plot(...). [image]
Once catalogs are loaded and the desired sources has been identified and selected, the selected sources will be available at the .sources attribute (intake.gui.sources). Each source entry has informational methods available and can be opened as a data source, as with any catalog entry:
In [ ]: source_entry = intake.gui.sources[0]
source_entry Out : name: sea_ice_origin container: dataframe plugin: ['csv'] description: Arctic/Antarctic Sea Ice direct_access: forbid user_parameters: [] metadata: args:
urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv In [ ]: data_source = source_entry() # may specify parameters here
data_source.read() Out : < some data > In [ ]: source_entry.plot() # or skip data source step Out : < graphics>
Data catalogs provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries. A catalog entry for a dataset includes information like:
In addition, Intake allows the arguments to data sources to be templated, with the variables explicitly expressed as "user parameters". The given arguments are rendered using jinja2, the values of named user parameterss, and any overrides. The parameters are also offer validation of the allowed types and values, for both the template values and the final arguments passed to the data source. The parameters are named and described, to indicate to the user what they are for. This kind of structure can be used to, for example, choose between two parts of a given data source, like "latest" and "stable", see the entry1_part entry in the example below.
The user of the catalog can always override any template or argument value at the time that they access a give source.
In Intake, a Catalog instance is an object with one or more named entries. The entries might be read from a static file (e.g., YAML, see the next section), from an Intake server or from any other data service that has a driver. Drivers which create catalogs are ordinary DataSource classes, except that they have the container type "catalog", and do not return data products via the read() method.
For example, you might choose to instantiate the base class and fill in some entries explicitly in your code
from intake.catalog import Catalog from intake.catalog.local import LocalCatalogEntry mycat = Catalog.from_dict({
'source1': LocalCatalogEntry(name, description, driver, args=...),
...
})
Alternatively, subclasses of Catalog can define how entries are created from whichever file format or service they interact with, examples including RemoteCatalog and SQLCatalog. These generate entries based on their respective targets; some provide advanced search capabilities executed on the server.
Intake catalogs can most simply be described with YAML files. This is very common in the tutorials and this documentation, because it simple to understand, but demonstrate the many features of Intake. Note that YAML files are also the easiest way to share a catalog, simply by copying to a publicly-available location such as a cloud storage bucket. Here is an example:
metadata:
version: 1
parameters:
file_name:
type: str
description: default file name for child entries
default: example_file_name sources:
example:
description: test
driver: random
args: {}
entry1_full:
description: entry1 full
metadata:
foo: 'bar'
bar: [1, 2, 3]
driver: csv
args: # passed to the open() method
urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'
entry1_part:
description: entry1 part
parameters: # User parameters
part:
description: section of the data
type: str
default: "stable"
allowed: ["latest", "stable"]
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'
entry2:
description: entry2
driver: csv
args:
# file_name parameter will be inherited from file-level parameters, so will
# default to "example_file_name"
urlpath: '{{ CATALOG_DIR }}/entry2/{{ file_name }}.csv`
Arbitrary extra descriptive information can go into the metadata section. Some fields will be claimed for internal use and some fields may be restricted to local reading; but for now the only field that is expected is version, which will be updated when a breaking change is made to the file format. Any catalog will have .metadata and .version attributes available.
Note that each source also has its own metadata.
The metadata section an also contain parameters which will be inherited by the sources in the file (note that these sources can augment these parameters, or override them with their own parameters).
The driver: entry of a data source specification can be a driver name, as has been shown in the examples so far. It can also be an absolute class path to use for the data source, in which case there will be no ambiguity about how to load the data. That is the the preferred way to be explicit, when the driver name alone is not enough (see Driver Selection, below).
plugins:
source:
- module: intake.catalog.tests.example1_source sources:
...
However, you do not, in general, need to do this, since the driver: field of each source can also explicitly refer to the plugin class.
The majority of a catalog file is composed of data sources, which are named data sets that can be loaded for the user. Catalog authors describe the contents of data set, how to load it, and optionally offer some customization of the returned data. Each data source has several attributes:
This method of defining the cache with a dedicated block is deprecated, see the Remote Access section, below
To enable caching on the first read of remote data source files, add the cache section with the following attributes:
Example:
test_cache:
description: cache a csv file from the local filesystem
driver: csv
cache:
- argkey: urlpath
type: file
args:
urlpath: '{{ CATALOG_DIR }}/cache_data/states.csv'
The cache_dir defaults to ~/.intake/cache, and can be specified in the intake configuration file or INTAKE_CACHE_DIR environment variable, or at runtime using the "cache_dir" key of the configuration. The special value "catdir" implies that cached files will appear in the same directory as the catalog file in which the data source is defined, within a directory named "intake_cache". These will not appear in the cache usage reported by the CLI.
Optionally, the cache section can have a regex attribute, that modifies the path of the cache on the disk. By default, the cache path is made by concatenating cache_dir, dataset name, hash of the url, and the url itself (without the protocol). regex attribute allows one to remove part of the url (the matching part).
Caching can be disabled at runtime for all sources regardless of the catalog specification:
from intake.config import conf conf['cache_disabled'] = True
By default, progress bars are shown during downloads if the package tqdm is available, but this can be disabled (e.g., for consoles that don't support complex text) with
or, equivalently, the environment parameter INTAKE_CACHE_PROGRESS.
The "types" of caching are that supported are listed in intake.source.cache.registry, see the docstrings of each for specific parameters that should appear in the cache block.
It is possible to work with compressed source files by setting type: compression in the cache specification. By default the compression type is inferred from the file extension, otherwise it can be set by assigning the decomp variable to any of the options listed in intake.source.decompress.decomp. This will extract all the file(s) in the compressed file referenced by urlpath and store them in the cache directory.
In cases where miscellaneous files are present in the compressed file, a regex_filter parameter can be used. Only the extracted filenames that match the pattern will be loaded. The cache path is appended to the filename so it is necessary to include a wildcard to the beginning of the pattern.
Example:
test_compressed:
driver: csv
args:
urlpath: 'compressed_file.tar.gz'
cache:
- type: compressed
decomp: tgz
argkey: urlpath
regex_filter: '.*data.csv'
Intake catalog files support Jinja2 templating for driver arguments. Any occurrence of a substring like {{field}} will be replaced by the value of the user parameters with that same name, or the value explicitly provided by the user. For how to specify these user parameters, see the next section.
Some additional values are available for templating. The following is always available: CATALOG_DIR, the full path to the directory containing the YAML catalog file. This is especially useful for constructing paths relative to the catalog directory to locate data files and custom drivers. For example, the search for CSV files for the two "entry1" blocks, above, will happen in the same directory as where the catalog file was found.
The following functions may be available. Since these execute code, the user of a catalog may decide whether they trust those functions or not.
The reason for the "client" versions of the functions is to prevent leakage of potentially sensitive information between client and server by controlling where lookups happen. When working without a server, only the ones without "client" are used.
An example:
sources:
personal_source:
description: This source needs your username
args:
url: "http://server:port/user/{{env(USER)}}"
Here, if the user is named "blogs", the url argument will resolve to "http://server:port/user/blogs"; if the environment variable is not defined, it will resolve to "http://server:port/user/"
A source definition can contain a "parameters" block. Expressed in YAML, a parameter may look as follows:
parameters:
name:
description: name to use # human-readable text for what this parameter means
type: str # one of bool, str, int, float, list[str | int | float], datetime, mlist
default: normal # optional, value to assume if user does not override
allowed: ["normal", "strange"] # optional, list of values that are OK, for validation
min: "n" # optional, minimum allowed, for validation
max: "t" # optional, maximum allowed, for validation
A parameter, not to be confused with an argument, can have one of two uses:
Note: the datetime type accepts multiple values: Python datetime, ISO8601 string, Unix timestamp int, "now" and "today".
You can also define user parameters at the catalog level. This applies the parameter to all entries within that catalog, without having to define it for each and every entry. Furthermore, catalogs dested within the catalog will also inherit the parameter(s).
For example, with the following spec
metadata:
version: 1
parameters:
bucket:
type: str
description: description
default: test_bucket sources:
param_source:
driver: parquet
description: description
args:
urlpath: s3://{{bucket}}/file.parquet
subcat:
driver: yaml_file
path: "{{CATALOG_DIR}}/other.yaml"
If cat is the corresponsing catalog instance, the URL of source cat.param_source will evaluate to "s3://test_bucket/file.parquet" by default, but the parameter can be overridden with cat.param_source(bucket="other_bucket"). Also, any entries of subcat, another catalog referenced from here, would also have the "bucket"-named parameter attached to all sources. Of course, those sources do no need to make use of the parameter.
To change the default, we can gerenate a new instance
cat2 = cat(bucket="production") # sets default value of "bucket" for cat2 subcat = cat.subcat(bucket="production") # sets default only for the nested catalog
Of course, in these situations you can still override the value of the parameter for any source, or pass explicit values for the arguments of the source, as normal.
For cases where the catalog is not defined in a YAML spec, the argument user_parameters to the constructor takes the same form as parameters above: a dict of user parameters, either as UserParameter instances or as a dictionary spec for each one.
Template functions can also be used in parameters (see Templating, above), but you can use the available functions directly without the extra {{...}}.
For example, this catalog entry uses the env("HOME") functionality as described to set a default based on the user's home directory.
sources:
variabledefault:
description: "This entry leads to an example csv file in the user's home directory by default, but the user can pass root="somepath" to override that."
driver: csv
args:
path: "{{root}}/example.csv"
parameters:
root:
description: "root path"
type: str
default: "env(HOME)"
In some cases, it may be possible that multiple backends are capable of loading from the same data format or service. Sometimes, this may mean two drivers with unique names, or a single driver with a parameter to choose between the different backends.
However, it is possible that multiple drivers for reading a particular type of data also share the same driver name: for example, both the intake-iris and the intake-xarray packages contain drivers with the name "netcdf", which are capable of reading the same files, but with different backends. Here we will describe the various possibilities of coping with this situation. Intake's plugin system makes it easy to encode such choices.
It may be acceptable to use any driver which claims to handle that data type, or to give the option of which driver to use to the user, or it may be necessary to specify which precise driver(s) are appropriate for that particular data. Intake allows all of these possibilities, even if the backend drivers require extra arguments.
Specifying a single driver explicitly, rather than using a generic name, would look like this:
sources:
example:
description: test
driver: package.module.PluginClass
args: {}
It is also possible to describe a list of drivers with the same syntax. The first one found will be the one used. Note that the class imports will only happen at data source instantiation, i.e., when the entry is selected from the catalog.
sources:
example:
description: test
driver:
- package.module.PluginClass
- another_package.PluginClass2
args: {}
These alternative plugins can also be given data-source specific names, allowing the user to choose at load time with driver= as a parameter. Additional arguments may also be required for each option (which, as usual, may include user parameters); however, the same global arguments will be passed to all of the drivers listed.
sources:
example:
description: test
driver:
first:
class: package.module.PluginClass
args:
specific_thing: 9
second:
class: another_package.PluginClass2
args: {}
(see also Remote Data for the implementation details)
Many drivers support reading directly from remote data sources such as HTTP, S3 or GCS. In these cases, the path to read from is usually given with a protocol prefix such as gcs://. Additional dependencies will typically be required (requests, s3fs, gcsfs, etc.), any data package should specify these. Further parameters may be necessary for communicating with the storage backend and, by convention, the driver should take a parameter storage_options containing arguments to pass to the backend. Some remote backends may also make use of environment variables or config files to determine their default behaviour.
The special template variable "CATALOG_DIR" may be used to construct relative URLs in the arguments to a source. In such cases, if the filesystem used to load that catalog contained arguments, then the storage_options of that file system will be extracted and passed to the source. Therefore, all sources which can accept general URLs (beyond just local paths) must make sure to accept this argument.
As an example of using storage_options, the following two sources would allow for reading CSV data from S3 and GCS backends without authentication (anonymous access), respectively
sources:
s3_csv:
driver: csv
description: "Publicly accessible CSV data on S3; requires s3fs"
args:
urlpath: s3://bucket/path/*.csv
storage_options:
anon: true
gcs_csv:
driver: csv
description: "Publicly accessible CSV data on GCS; requires gcsfs"
args:
urlpath: gcs://bucket/path/*.csv
storage_options:
token: "anon"
Using S3 Profiles
An AWS profile may be specified as an argument under storage_options via the following format:
args:
urlpath: s3://bucket/path/*.csv
storage_options:
profile: aws-profile-name
URLs interpreted by fsspec offer automatic caching. For example, to enable file-based caching for the first source above, you can do:
sources:
s3_csv:
driver: csv
description: "Publicly accessible CSV data on S3; requires s3fs"
args:
urlpath: simplecache::s3://bucket/path/*.csv
storage_options:
s3:
anon: true
Here we have added the "simplecache" to the URL (this caching backend does not store any metadata about the cached file) and specified that the "anon" parameter is meant as an argument to s3, not to the caching mechanism. As each file in s3 is accessed, it will first be downloaded and then the local version used instead.
You can tailor how the caching works. In particular the location of the local storage can be set with the cache_storage parameter (under the "simplecache" group of storage_options, of course) - otherwise they are stored in a temporary location only for the duration of the current python session. The cache location is particularly useful in conjunction with an environment variable, or relative to "{{CATALOG_DIR}}", wherever the catalog was loaded from.
Please see the fsspec documentation for the full set of cache types and their various options.
A Catalog can be loaded from a YAML file on the local filesystem by creating a Catalog object:
from intake import open_catalog cat = open_catalog('catalog.yaml')
Then sources can be listed:
list(cat)
and data sources are loaded via their name:
data = cat.entry_part1
and you can optionally configure new instances of the source to define user parameters or override arguments by calling either of:
data = cat.entry_part1.configure_new(part='1') data = cat.entry_part1(part='1') # this is a convenience shorthand
Intake also supports loading a catalog from all of the files ending in .yml and .yaml in a directory, or by using an explicit glob-string. Note that the URL provided may refer to a remote storage systems by passing a protocol specifier such as s3://, gcs://.:
cat = open_catalog('/research/my_project/catalog.d/')
Intake Catalog objects will automatically reload changes or new additions to catalog files and directories on disk. These changes will not affect already-opened data sources.
A catalog is just another type of data source for Intake. For example, you can print a YAML specification corresponding to a catalog as follows:
cat = intake.open_catalog('cat.yaml') print(cat.yaml())
results in:
sources:
cat:
args:
path: cat.yaml
description: ''
driver: intake.catalog.local.YAMLFileCatalog
metadata: {}
The point here, is that this can be included in another catalog. (It would, of course, be better to include a description and the full path of the catalog file here.) If the entry above were saved to another file, "root.yaml", and the original catalog contained an entry, data, you could access it as:
root = intake.open_catalog('root.yaml') root.cat.data
It is, therefore, possible to build up a hierarchy of catalogs referencing each other. These can, of course, include remote URLs and indeed catalog sources other than simple files (all the tables on a SQL server, for instance). Plus, since the argument and parameter system also applies to entries such as the example above, it would be possible to give the user a runtime choice of multiple catalogs to pick between, or have this decision depend on an environment variable.
Intake also includes a server which can share an Intake catalog over HTTP (or HTTPS with the help of a TLS-enabled reverse proxy). From the user perspective, remote catalogs function identically to local catalogs:
cat = open_catalog('intake://catalog1:5000') list(cat)
The difference is that operations on the catalog translate to requests sent to the catalog server. Catalog servers provide access to data sources in one of two modes:
Whether a particular catalog entry supports direct or proxied access is determined by the direct_access option:
Note that when the client is loading a data source via direct access, the catalog server will need to send the driver arguments to the client. Do not include sensitive credentials in a data source that allows direct access.
Intake servers can check if clients are authorized to access the catalog as a whole, or individual catalog entries. Typically a matched pair of server-side plugin (called an "auth plugin") and a client-side plugin (called a "client auth plugin) need to be enabled for authorization checks to work. This feature is still in early development, but see module intake.auth.secret for a demonstration pair of server and client classes implementation auth via a shared secret. See Authorization Plugins.
The package installs two executable commands: for starting the catalog server; and a client for accessing catalogs and manipulating the configuration.
A file-based configuration service is available to Intake. This file is by default sought at the location ~/.intake/conf.yaml, but either of the environment variables INTAKE_CONF_DIR or INTAKE_CONF_FILE can be used to specify another directory or file. If both are given, the latter takes priority.
At present, the configuration file might look as follows:
auth:
cls: "intake.auth.base.BaseAuth" port: 5000 catalog_path:
- /home/myusername/special_dir
These are the defaults, and any parameters not specified will take the values above
See intake.config.defaults for a full list of keys and their default values.
The logging level is configurable using Python's built-in logging module.
The config option 'logging' holds the current level for the intake logger, and can take values such as 'INFO' or 'DEBUG'. This can be set in the conf.yaml file of the config directory (e.g., ~/.intake/), or overridden by the environment variable INTAKE_LOG_LEVEL.
Furthermore, the level and settings of the logger can be changed programmatically in code:
import logging logger = logging.getLogger('intake') logger.setLevel(logging.DEBUG) logget.addHandler(..)
The server takes one or more catalog files as input and makes them available on port 5000 by default.
You can see the full description of the server command with:
>>> intake-server --help usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
[--flatten] [--no-flatten] [-a ADDRESS]
FILE [FILE ...] Intake Catalog Server positional arguments:
FILE Name of catalog YAML file optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT port number for server to listen on
--list-entries list catalog entries at startup
--sys-exit-on-sigterm
internal flag used during unit testing to ensure
.coverage file is written
--flatten
--no-flatten
-a ADDRESS, --address ADDRESS
address to use as a host, defaults to the address in
the configuration file, if provided otherwise localhost
usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
[--flatten] [--no-flatten] [-a ADDRESS]
FILE [FILE ...]
To start the server with a local catalog file, use the following:
>>> intake-server intake/catalog/tests/catalog1.yml Creating catalog from:
- intake/catalog/tests/catalog1.yml catalog_args ['intake/catalog/tests/catalog1.yml'] Entries: entry1,entry1_part,use_example1 Listening on port 5000
You can use the catalog client (defined below) using:
$ intake list intake://localhost:5000 entry1 entry1_part use_example1
While the Intake data sources will typically be accessed through the Python API, you can use the client to verify a catalog file.
Unlike the server command, the client has several subcommands to access a catalog. You can see the list of available subcommands with:
>>> intake --help usage: intake {list,describe,exists,get,discover} ...
We go into further detail in the following sections.
This subcommand lists the names of all available catalog entries. This is useful since other subcommands require these names.
If you wish to see the details about each catalog entry, use the --full flag. This is equivalent to running the intake describe subcommand for all catalog entries.
>>> intake list --help usage: intake list [-h] [--full] URI positional arguments:
URI Catalog URI optional arguments:
-h, --help show this help message and exit
--full
>>> intake list intake/catalog/tests/catalog1.yml entry1 entry1_part use_example1 >>> intake list --full intake/catalog/tests/catalog1.yml [entry1] container=dataframe [entry1] description=entry1 full [entry1] direct_access=forbid [entry1] user_parameters=[] [entry1_part] container=dataframe [entry1_part] description=entry1 part [entry1_part] direct_access=allow [entry1_part] user_parameters=[{'default': '1', 'allowed': ['1', '2'], 'type': u'str', 'name': u'part', 'description': u'part of filename'}] [use_example1] container=dataframe [use_example1] description=example1 source plugin [use_example1] direct_access=forbid [use_example1] user_parameters=[]
Given the name of a catalog entry, this subcommand lists the details of the respective catalog entry.
>>> intake describe --help usage: intake describe [-h] URI NAME positional arguments:
URI Catalog URI
NAME Catalog name optional arguments:
-h, --help show this help message and exit
>>> intake describe intake/catalog/tests/catalog1.yml entry1 [entry1] container=dataframe [entry1] description=entry1 full [entry1] direct_access=forbid [entry1] user_parameters=[]
Given the name of a catalog entry, this subcommand returns a key-value description of the data source. The exact details are subject to change.
>>> intake discover --help usage: intake discover [-h] URI NAME positional arguments:
URI Catalog URI
NAME Catalog name optional arguments:
-h, --help show this help message and exit
>>> intake discover intake/catalog/tests/catalog1.yml entry1 {'npartitions': 2, 'dtype': dtype([('name', 'O'), ('score', '<f8'), ('rank', '<i8')]), 'shape': (None,), 'datashape':None, 'metadata': {'foo': 'bar', 'bar': [1, 2, 3]}}
Given the name of a catalog entry, this subcommand returns whether or not the respective catalog entry is valid.
>>> intake exists --help usage: intake exists [-h] URI NAME positional arguments:
URI Catalog URI
NAME Catalog name optional arguments:
-h, --help show this help message and exit
>>> intake exists intake/catalog/tests/catalog1.yml entry1 True >>> intake exists intake/catalog/tests/catalog1.yml entry2 False
Given the name of a catalog entry, this subcommand outputs the entire data source to standard output.
>>> intake get --help usage: intake get [-h] URI NAME positional arguments:
URI Catalog URI
NAME Catalog name optional arguments:
-h, --help show this help message and exit
>>> intake get intake/catalog/tests/catalog1.yml entry1
name score rank 0 Alice1 100.5 1 1 Bob1 50.3 2 2 Charlie1 25.0 3 3 Eve1 25.0 3 4 Alice2 100.5 1 5 Bob2 50.3 2 6 Charlie2 25.0 3 7 Eve2 25.0 3
CLI functions starting with intake cache and intake config are available to provide information about the system: the locations and value of configuration parameters, and the state of cached files.
(this is an experimental new feature, expect enhancements and changes)
As defined in the glossary, to Persist is to convert data into the storage format most appropriate for the container type, and save a copy of this for rapid lookup in the future. This is of great potential benefit where the creation or transfer of the original data source takes some time.
This is not to be confused with the file Cache.
Any Data Source has a method .persist(). The only option that you will need to pick is a TTL, the number of seconds that the persisted version lasts before expiry (leave as None for no expiry). This creates a local copy in the persist directory, which may be in "~/.intake/persist, but can be configured.
Each container type (dataframe, array, ...) will have its own implementation of persistence, and a particular file storage format associated. The call to .persist() may take arguments to tune how the local files are created, and in some cases may require additional optional packages to be installed.
Example:
cat = intake.open_catalog('mycat.yaml') # load a remote cat source = cat.csvsource() # source pointing to remote data source.persist() source = cat.csvsource() # future use now gives local intake_parquet.ParquetSource
To control whether a catalog will automatically give you the persisted version of a source in this way using the argument persist_mode, e.g., to ignore locally persisted versions, you could have done:
cat = intake.open_catalog('mycat.yaml', persist_mode='never') or source = cat.csvsource(persist_mode='never')
Note that if you give a TTL (in seconds), then the original source will be accessed and a new persisted version written transparently when the old persisted version has expired.
Note that after persisting, the original source will have source.has_been_persisted == True and the persisted source (i.e., the one loaded from local files) will have source.is_persisted == True.
A similar concept to Persist, Export allows you to make a copy of some data source, in the format appropriate for its container, and place this data-set in whichever location suits you, including remote locations. This functionality (source.export()) does not touch the persist store; instead, it returns a YAML text representation of the output, so that you can put it into a catalog of your own. It would be this catalog that you share with other people.
Note that "exported" data-sources like this do contain the information of the original source they were made from in their metadata, so you can recreate the original source, if you want to, and read from there.
If you are typically running your code inside of ephemoral containers, then persisting data-sets may be something that you want to do (because the original source is slow, or parsing is CPU/memory intensive), but the local storage is not useful. In some cases you may have access to some shared network storage mounted on the instance, but in other cases you will want to persist to a remote store.
The config value 'persist_path', which can also be set by the environment variable INTAKE_PERSIST_PATH can be a remote location such as s3://mybucket/intake-persist. You will need to install the appropriate package to talk to the external storage (e.g., s3fs, gcsfs, pyarrow), but otherwise everything should work as before, and you can access the persisted data from any container.
You can interact directly with the class implementing persistence:
from intake.container.persist import store
This singleton instance, which acts like a catalog, allows you to query the contents of the instance store and to add and remove entries. It also allows you to find the original source for any given persisted source, and refresh the persisted version on demand.
For details on the methods of the persist store, see the API documentation: intake.container.persist.PersistStore(). Sources in the store carry a lot of information about the sources they were made from, so that they can be remade successfully. This all appears in the source metadata. The sources use the "token" of the original data source as their key in the store, a value which can be found by dask.base.tokenize(source) for the original source, or can be taken from the metadata of a persisted source.
Note that all of the information about persisted sources is held in a single YAML file in the persist directory (typically /persisted/cat.yaml within the config directory, but see intake.config.conf['persist_path']). This file can be edited by hand if you wanted to, for example, set some persisted source not to expire. This is only recommended for experts.
Intake provides a plotting API based on the hvPlot library, which closely mirrors the pandas plotting API but generates interactive plots using HoloViews and Bokeh.
The hvPlot website provides comprehensive documentation on using the plotting API to quickly visualize and explore small and large datasets. The main features offered by the plotting API include:
Using Intake alongside hvPlot allows declaratively persisting plot declarations and default options in the regular catalog.yaml files.
For detailed installation instructions see the getting started section in the hvPlot documentation. To start with install hvplot using conda:
conda install -c conda-forge hvplot
or using pip:
pip install hvplot
The plotting API is designed to work well in and outside the Jupyter notebook, however when using it in JupyterLab the PyViz lab extension must be installed first:
jupyter labextension install @pyviz/jupyterlab_pyviz
For detailed instructions on displaying plots in the notebook and from the Python command prompt see the hvPlot user guide.
Assuming the US Crime dataset has been installed (in the intake-examples repo, or from conda with conda install -c intake us_crime):
Once installed the plot API can be used, by using the .plot method on an intake DataSource:
import intake import hvplot as hp crime = intake.cat.us_crime columns = ['Burglary rate', 'Larceny-theft rate', 'Robbery rate', 'Violent Crime rate'] violin = crime.plot.violin(y=columns, group_label='Type of crime',
value_label='Rate per 100k', invert=True) hp.show(violin)
Inside the notebook plots will display themselves, however the notebook extension must be loaded first. The extension may be loaded by importing hvplot.intake module or explicitly loading the holoviews extension, or by calling intake.output_notebook():
# To load the extension run this import import hvplot.intake # Or load the holoviews extension directly import holoviews as hv hv.extension('bokeh') # convenience function import intake intake.output_notebook() crime = intake.cat.us_crime columns = ['Violent Crime rate', 'Robbery rate', 'Burglary rate'] crime.plot(x='Year', y=columns, value_label='Rate (per 100k people)')
Some catalogs will define plots appropriate to a specific data source. These will be specified such that the user gets the right view with the right columns and labels, without having to investigate the data in detail -- this is ideal for quick-look plotting when browsing sources.
import intake intake.us_crime.plots
Returns ['example']. This works whether accessing the entry object or the source instance. To visualise
intake.us_crime.plot.example()
Intake allows catalog yaml files to declare metadata fields for each data source which are made available alongside the actual dataset. The plotting API reserves certain fields to define default plot options, to label and annotate the data fields in a dataset and to declare pre-defined plots.
The first set of metadata used by the plotting API is the plot field in the metadata section. Any options found in the metadata field will apply to all plots generated from that data source, allowing the definition of plotting defaults. For example when plotting a fairly large dataset such as the NYC Taxi data, it might be desirable to enable datashader by default ensuring that any plot that supports it is datashaded. The syntax to declare default plot options is as follows:
sources:
nyc_taxi:
description: NYC Taxi dataset
driver: parquet
args:
urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
metadata:
plot:
datashade: true
The columns of a CSV or parquet file or the coordinates and data variables in a NetCDF file often have shortened, or cryptic names with underscores. They also do not provide additional information about the units of the data or the range of values, therefore the catalog yaml specification also provides the ability to define additional information about the fields in a dataset.
Valid attributes that may be defined for the data fields include:
Just like the default plot options the fields may be declared under the metadata section of a data source:
sources:
nyc_taxi:
description: NYC Taxi dataset
driver: parquet
args:
urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
metadata:
fields:
dropoff_x:
label: Longitude
dropoff_y:
label: Latitude
total_fare:
label: Fare
unit: $
As shown in the hvPlot user guide, the plotting API provides a variety of plot types, which can be declared using the kind argument or via convenience methods on the plotting API, e.g. cat.source.plot.scatter(). In addition to declaring default plot options and field metadata data sources may also declare custom plot, which will be made available as methods on the plotting API. In this way a catalogue may declare any number of custom plots alongside a datasource.
To make this more concrete consider the following custom plot declaration on the plots field in the metadata section:
sources:
nyc_taxi:
description: NYC Taxi dataset
driver: parquet
args:
urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
metadata:
plots:
dropoff_scatter:
kind: scatter
x: dropoff_x
y: dropoff_y
datashade: True
width: 800
height: 600
This declarative specification creates a new custom plot called dropoff_scatter, which will be available on the catalog under cat.nyc_taxi.plot.dropoff_scatter(). Calling this method on the plot API will automatically generate a datashaded scatter plot of the dropoff locations in the NYC taxi dataset.
Of course the three metadata fields may also be used together, declaring global defaults under the plot field, annotations for the data fields under the fields key and custom plots via the plots field.
This is a list of known projects which install driver plugins for Intake, and the named drivers each contains in parentheses:
The status of these projects is available at Status Dashboard.
Don't see your favorite format? See Making Drivers for how to create new plugins.
Note that if you want your plugin listed here, open an issue in the Intake issue repository and add an entry to the status dashboard repository. We also have a plugin wishlist Github issue that shows the breadth of plugins we hope to see for Intake.
This page gives deeper details on how the Intake server is implemented. For those simply wishing to run and configure a server, see the Command Line Tools section.
Communication between the intake client and server happens exclusively over HTTP, with all parameters passed using msgpack UTF8 encoding. The server side is implemented by the module intake.cli.server. Currently, only the following two routes are available:
The server may be configured to use auth services, which, when passed the header of the incoming call, can determine whether the given request is allowed. See Authorization Plugins.
Retrieve information about the data-sets available on this server. The list of data-sets may be paginated, in order to avoid excessively long transactions. Notice that the catalog for which a listing is being requested can itself be a data-source (when source-id is passed) - this is how nested sub-catalogs are handled on the server.
Fetch information about a specific source. This is the random-access variant of the GET /info route, by which a particular data-source can be accessed without paginating through all of the sources.
Same as one of the entries in sources for GET /info: the result of .describe() on the given data-source in the server
Searching a Catalog returns search results in the form of a new Catalog. This "results" Catalog is cached on the server the same as any other Catalog.
This is a more involved processing of a data-source, and, if successful, returns one of two possible scenarios:
The set of parameters supplied and the server/client policies will define which method of access is employed. In the case of remote-access, the data source is instantiated on the server, and .discover() run on it. The resulting information is passed back, and must be enough to instantiate a subclass of intake.container.base.RemoteSource appropriate for the container of the data-set in question (e.g., RemoteArray when container="ndarray"). In this case, the response also includes a UUID string for the open instance on the server, referencing the cache of open sources maintained by the server.
Note that "opening" a data entry which is itself is a catalog implies instantiating that catalog object on the server and returning its UUID, such that a listing can be made using GET/ info or GET /source.
If direct-access, the driver plugin name and set of arguments for instantiating the data-soruce in the client.
If remote-access, the data-source container, schema and source-ID so that further reads can be made from the server.
This route fetches data from the server once a data-source has been opened in remote-access mode.
aka. derived datasets.
WARNING:
Intake allows for the definition of data sources which take as their input another source in the same directory, so that you have the opportunity to present processing to the user of the catalog.
The "target" or a derived data source will normally be a string. In the simple case, it is the name of a data source in the same catalog. However, we use the syntax "catalog:source" to refer to sources in other catalogs, where the part before ":" will be passed to intake.open_catalog(), together with any keyword arguments from cat_kwargs.
This can be done by defining classes which inherit from intake.source.derived.DerivedSource, or using one of the pre-defined classes in the same module, which usually need to be passed a reference to a function in a python module. We will demonstrate both.
Consider the following target dataset, which loads some simple facts about US states from a CSV file. This example is taken from the Intake test suite.
We now show two ways to apply a super-simple transform to this data, which selects two of the dataframe's columns.
The first version uses an approach in which the transform is derived in a data source class, and the parameters passed are specific to the transform type. Note that the driver is referred to by it's fully-qualified name in the Intake package.
The source class for this is included in the Intake codebase, but the important part is:
class Columns(DataFrameTransform):
...
def pick_columns(self, df):
return df[self._params["columns"]]
We see that this specific class inherits from DataFrameTransform, with transform=self.pick_columns. We know that the inputs and outputs are both dataframes. This allows for some additional validation and an automated way to infer the output dataframe's schema that reduces the number of line of code required.
The given method does exactly what you might imagine: it takes and input dataframe and applies a column selection to it.
Running cat.derive_cols.read() will indeed, as expected, produce a version of the data with only the selected columns included. It does this by defining the original dataset, appying the selection, and then getting Dask to generate the output. For some datasets, this can mean that the selection is pushed down to the reader, and the data for the dropped columns is never loaded. The user may choose to do .to_dask() instead, and manipulate the lazy dataframe directly, before loading.
This second version of the same output uses the more generic and flexible intake.source.derived.DataFrameTransform.
derive_cols_func:
driver: intake.source.derived.DataFrameTransform
args:
targets:
- input_data
transform: "intake.source.tests.test_derived._pick_columns"
transform_kwargs:
columns: ["state", "slug"]
In this case, we pass a reference to a function defined in the Intake test suite. Normally this would be declared in user modules, where perhaps those declarations and catalog(s) are distributed together as a package.
def _pick_columns(df, columns):
return df[columns]
This is, of course, very similar to the method shown in the previous section, and again applies the selection in the given named argument to the input. Note that Intake does not support including actual code in your catalog, since we would not want to allow arbitrary execution of code on catalog load, as opposed to execution.
Loading this data source proceeds exactly the same way as the class-based approach, above. Both Dask and in-memory (Pandas, via .read()) methods work as expected. The declaration in YAML, above, is slightly more verbose, but the amount of code is smaller. This demonstrates a tradeoff between flexibility and concision. If there were validation code to add for the arguments or input dataset, it would be less obvious where to put these things.
The previous two examples both did dataframe to dataframe transforms. However, totally arbitrary computations are possible. Consider the following:
barebones:
driver: intake.source.derived.GenericTransform
args:
targets:
- input_data
transform: builtins.len
transform_kwargs: {}
This applies len to the input dataframe. cat.barebones.describe() gives the output container type as "other", i.e., not specified. The result of read() on this gives the single number 50, the number of rows in the input data. This class, and DerivedDataSource and included with the intent as superclasses, and probably will not be used directly often.
None of the above examples specified explicitly where the compute implied by the transformation will take place. However, most Intake drivers support in-memory containers and Dask; remembering that the input dataset here is a dataframe. However, the behaviour is defined in the driver class itself - so it would be fine to write a driver in which we make different assumptions. Let's suppose, for instance, that the original source is to be loaded from spark (see the intake-spark package), the driver could explicitly call .to_spark on the original source, and be assured that it has a Spark object to work with. It should, of course, explain in its documentation what assumptions are being made and that, presumably, the user is expected to also call .to_spark if they wished to directly manipulate the spark object.
intake.source.derived.DerivedSource(*args, ...) | Base source deriving from another source in the same catalog |
intake.source.derived.GenericTransform(...) | |
intake.source.derived.DataFrameTransform(...) | Transform where the input and output are both Dask-compatible dataframes |
intake.source.derived.Columns(*args, **kwargs) | Simple dataframe transform to pick columns |
Target picking and parameter validation are performed here, but you probably want to subclass from one of the more specific classes like DataFrameTransform.
This derives from GenericTransform, and you must supply transform and any transform_kwargs.
Given as an example of how to make a specific dataframe transform. Note that you could use DataFrameTransform directly, by writing a function to choose the columns instead of a method as here.
Auto-generated reference
These are reference class and function definitions likely to be useful to everyone.
intake.open_catalog([uri]) | Create a Catalog object |
intake.registry | Dict of driver: DataSource class |
intake.register_driver(name, value[, ...]) | Add runtime driver definition |
intake.upload(data, path, **kwargs) | Given a concrete data object, store it at given location return Source |
intake.source.csv.CSVSource(*args, **kwargs) | Read CSV files into dataframes |
intake.source.textfiles.TextFilesSource(...) | Read textfiles as sequence of lines |
intake.source.jsonfiles.JSONFileSource(...) | Read JSON files as a single dictionary or list |
intake.source.jsonfiles.JSONLinesFileSource(...) | Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being valid json object (e.g. |
intake.source.npy.NPySource(*args, **kwargs) | Read numpy binary files into an array |
intake.source.zarr.ZarrArraySource(*args, ...) | Read Zarr format files into an array |
intake.catalog.local.YAMLFileCatalog(*args, ...) | Catalog as described by a single YAML file |
intake.catalog.local.YAMLFilesCatalog(*args, ...) | Catalog as described by a multiple YAML files |
intake.catalog.zarr.ZarrGroupCatalog(*args, ...) | A catalog of the members of a Zarr group. |
Can load YAML catalog files, connect to an intake server, or create any arbitrary Catalog subclass instance. In the general case, the user should supply driver= with a value from the plugins registry which has a container type of catalog. File locations can generally be remote, if specifying a URL protocol.
The default behaviour if not specifying the driver is as follows:
SEE ALSO:
Use this function to publicly share data which you have created in your python session. Intake will try each of the container types, to see if one of them can handle the input data, and write the data to the path given, in the format most appropriate for the data type, e.g., parquet for pandas or dask data-frames.
With the DataSource instance you get back, you can add this to a catalog, or just get the YAML representation for editing (.yaml()) and sharing.
Prototype of sources reading dataframe data
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
Zarr is an numerical array storage format which works particularly well with remote and parallel access. For specifics of the format, see https://zarr.readthedocs.io/en/stable/
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
Prototype of sources reading sequential data.
Takes a set of files, and returns an iterator over the text in each of them. The files can be local or remote. Extra parameters for encoding, etc., go into storage_options.
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
The files can be local or remote. Extra parameters for encoding, etc., go into storage_options.
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
Prototype source showing example of working with arrays
Each file becomes one or more partitions, but partitioning within a file is only along the largest dimension, to ensure contiguous data.
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).
This is a reference API class listing, useful mainly for developers.
intake.source.base.DataSourceBase(*args, ...) | An object which can produce data |
intake.source.base.DataSource(*args, **kwargs) | A Data Source will all optional functionality |
intake.source.base.PatternMixin() | Helper class to provide file-name parsing abilities to a driver class |
intake.container.base.RemoteSource(*args, ...) | Base class for all DataSources living on an Intake server |
intake.catalog.Catalog(*args, **kwargs) | Manages a hierarchy of data sources as a collective unit. |
intake.catalog.entry.CatalogEntry(*args, ...) | A single item appearing in a catalog |
intake.catalog.local.UserParameter(*args, ...) | A user-settable item that is passed to a DataSource upon instantiation. |
intake.auth.base.BaseAuth(*args, **kwargs) | Base class for authorization |
intake.source.cache.BaseCache(driver, spec) | Provides utilities for managing cached data files. |
intake.source.base.Schema(**kwargs) | Holds details of data description for any type of data-source |
intake.container.persist.PersistStore(*args, ...) | Specialised catalog for persisted data-sources |
When subclassed, child classes will have the base data source functionality, plus caching, plotting and persistence abilities.
A catalog is a set of available data sources for an individual entity (remote server, local file, or a local directory of files). This can be expanded to include a collection of subcatalogs, which are then managed as a single unit.
A catalog is created with a single URI or a group of URIs. A URI can either be a URL or a file path.
Each catalog in the hierarchy is responsible for caching the most recent refresh time to prevent overeager queries.
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
WARNING:
NOTE:
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
This relies on the _entries attribute being mutable, which it normally is. Note that if a catalog automatically reloads, any entry removed here may soon reappear
Note that this is not the same as .yaml(), which produces a YAML block referring to this catalog.
This is the base class, used by local entries (i.e., read from a YAML file) and by remote entries (read from a server).
Equivalent to calling the catalog entry like a function.
Note: entry(), entry.attr, entry[item] check for persisted sources, but directly calling .get() will always ignore the persisted store (equivalent to self._pmode=='never').
For string parameters, default may include special functions func(args), which may be expanded from environment variables or by executing a shell command.
Subclass this and override the methods to implement a new type of auth.
This basic class allows all access.
Returns the value if key match is found, otherwise default.
Providers of caching functionality should derive from this, and appear as entries in registry. The principle methods to override are _make_files() and _load() and _from_metadata().
This should always be pickleable, so that it can be sent from a server to a client, and contain all information needed to recreate a RemoteSource on the client.
Strings are assumed to already be a token; if source or entry, see if it is a persisted thing ("original_tok" is in its metadata), else generate its own token.
Will return True if the source is not in the store at all, if it's TTL is set to None, or if more seconds have passed than the TTL.
intake.source.cache.FileCache(driver, spec) | Cache specific set of files |
intake.source.cache.DirCache(driver, spec[, ...]) | Cache a complete directory tree |
intake.source.cache.CompressedCache(driver, spec) | Cache files extracted from downloaded compressed source |
intake.source.cache.DATCache(driver, spec[, ...]) | Use the DAT protocol to replicate data |
intake.source.cache.CacheMetadata(*args, ...) | Utility class for managing persistent metadata stored in the Intake config directory. |
Input is a single file URL, URL with glob characters or list of URLs. Output is a specific set of local files.
Input is a directory root URL, plus a depth parameter for how many levels of subdirectories to search. All regular files will be copied. Output is the resultant local directory tree.
For one or more remote compressed files, downloads to local temporary dir and extracts all contained files to local cache. Input is URL(s) (including globs) pointing to remote compressed files, plus optional decomp, which is "infer" by default (guess from file extension) or one of the key strings in intake.source.decompress.decomp. Optional regex_filter parameter is used to load only the extracted files that match the pattern. Output is the list of extracted files.
For details of the protocol, see https://docs.datproject.org/ The executable dat must be available.
Since in this case, it is not possible to access the remote files directly, this cache mechanism takes no parameters. The expectation is that the url passed by the driver is of the form:
dat://<dat hash>/file_pattern
where the file pattern will typically be a glob string like "*.json".
intake.auth.secret.SecretAuth(*args, **kwargs) | A very simple auth mechanism using a shared secret |
intake.auth.secret.SecretClientAuth(secret) | Matching client auth plugin to SecretAuth |
intake.container.dataframe.RemoteDataFrame(...) | Dataframe on an Intake server |
intake.container.ndarray.RemoteArray(*args, ...) | nd-array on an Intake server |
intake.container.semistructured.RemoteSequenceSource(...) | Sequence-of-things source on an Intake server |
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
intake.cli.server.server.IntakeServer(catalog) | Main intake-server tornado application |
intake.cli.server.server.ServerInfoHandler(...) | Basic info about the server |
intake.cli.server.server.SourceCache() | Stores DataSources requested by some user |
intake.cli.server.server.ServerSourceHandler(...) | Open or stream data source |
The requests "action" field (open|read) specified what the request wants to do. Open caches the source and created an ID for it, read uses that ID to reference the source and read a partition.
This is for direct access to an entry by name for random access, which is useful to the client when the whole catalog has not first been listed and pulled locally (e.g., in the case of pagination).
Released on August 26, 2022.
Released on January 9, 2022.
The goal of the Intake plugin system is to make it very simple to implement a Driver for a new data source, without any special knowledge of Dask or the Intake catalog system.
Although Intake is very flexible about data, there are some basic assumptions that a driver must satisfy.
Intake currently supports 3 kinds of containers, represented the most common data models used in Python:
Although a driver can load any type of data into any container, and new container types can be added to the list above, it is reasonable to expect that the number of container types remains small. Declaring a container type is only informational for the user when read locally, but streaming of data from a server requires that the container type be known to both server and client.
A given driver must only return one kind of container. If a file format (such as HDF5) could reasonably be interpreted as two different data models depending on usage (such as a dataframe or an ndarray), then two different drivers need to be created with different names. If a driver returns the python container, it should document what Python objects will appear in the list.
The source of data should be essentially permanent and immutable. That is, loading the data should not destroy or modify the data, nor should closing the data source destroy the data either. When a data source is serialized and sent to another host, it will need to be reopened at the destination, which may cause queries to be re-executed and files to be reopened. Data sources that treat readers as "consumers" and remove data once read will cause erratic behavior, so Intake is not suitable for accessing things like FIFO message queues.
The schema of a data source is a detailed description of the data, which can be known by loading only metadata or by loading only some small representative portion of the data. It is information to present to the user about the data that they are considering loading, and may be important in the case of server-client communication. In the latter context, the contents of the schema must be serializable by msgpack (i.e., numbers, strings, lists and dictionaries only).
There may be unknown parts of the schema before the whole data is read. drivers may require this unknown information in the __init__() method (or the catalog spec), or do some kind of partial data inspection to determine the schema; or more simply, may be given as unknown None values. Regardless of method used, the time spent figuring out the schema ahead of time should be short and not scale with the size of the data.
Typical fields in a schema dictionary are npartitions, dtype, shape, etc., which will be more appropriate for some drivers/data-types than others.
Data sources are assumed to be partitionable. A data partition is a randomly accessible fragment of the data. In the case of sequential and data-frame sources, partitions are numbered, starting from zero, and correspond to contiguous chunks of data divided along the first dimension of the data structure. In general, any partitioning scheme is conceivable, such as a tuple-of-ints to index the chunks of a large numerical array.
Not all data sources can be partitioned. For example, file formats without sufficient indexing often can only be read from beginning to end. In these cases, the DataSource object should report that there is only 1 partition. However, it often makes sense for a data source to be able to represent a directory of files, in which case each file will correspond to one partition.
Once opened, a DataSource object can have arbitrary metadata associated with it. The metadata for a data source should be a dictionary that can be serialized as JSON. This metadata comes from the following sources:
From the user perspective, all of the metadata should be loaded once the data source has loaded the rest of the schema (after discover(), read(), to_dask(), etc have been called).
Every Intake driver class should be a subclass of intake.source.base.DataSource. The class should have the following attributes to identify itself:
The __init()__ method should always accept a keyword argument metadata, a dictionary of metadata from the catalog to associate with the source. This dictionary must be serializable as JSON.
The DataSourceBase class has a small number of methods which should be overridden. Here is an example producing a data-frame:
class FooSource(intake.source.base.DataSource):
container = 'dataframe'
name = 'foo'
version = '0.0.1'
partition_access = True
def __init__(self, a, b, metadata=None):
# Do init here with a and b
super(FooSource, self).__init__(
metadata=metadata
)
def _get_schema(self):
return intake.source.base.Schema(
datashape=None,
dtype={'x': "int64", 'y': "int64"},
shape=(None, 2),
npartitions=2,
extra_metadata=dict(c=3, d=4)
)
def _get_partition(self, i):
# Return the appropriate container of data here
return pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]})
def read(self):
self._load_metadata()
return pd.concat([self.read_partition(i) for i in range(self.npartitions)])
def _close(self):
# close any files, sockets, etc
pass
Most of the work typically happens in the following methods:
The full set of user methods of interest are as follows:
Note that all of these methods typically call _get_schema, to make sure that the source has been initialised.
DataSource provides the same functionality as DataSourceBase, but has some additional mixin classes to provide some extras. A developer may choose to derive from DataSource to get all of these, or from DataSourceBase and make their own choice of mixins to support.
Intake discovers available drivers in three different ways, described below. After the discovery phase, Intake will automatically create open_[driver_name] convenience functions under the intake module namespace. Calling a function like open_csv() is equivalent to instantiating the corresponding data-source class.
If you are packaging your driver into an installable package to be shared, you should add the following to the package's setup.py:
setup(
...
entry_points={
'intake.drivers': [
'some_format_name = some_package.and_maybe_a_submodule:YourDriverClass',
...
]
}, )
IMPORTANT:
Entry points are a way for Python packages to advertise objects with some common interface. When Intake is imported, it discovers all packages installed in the current environment that advertise 'intake.drivers' in this way.
Most packages that define intake drivers have a dependency on intake itself, for example in order to use intake's base classes. This can create a circular dependency: importing the package imports intake, which tries to discover and import packages that define drivers. To avoid this pitfall, just ensure that intake is imported first thing in your package's __init__.py. This ensures that the driver-discovery code runs first. Note that you are not required to make your package depend on intake. The rule is that if you import intake you must import it first thing. If you do not import intake, there is no circularity.
The intake configuration file can be used to:
The commandline invocation
intake drivers enable some_format_name some_package.and_maybe_a_submodule.YourDriverClass
is equivalent to adding this to your intake configuration file:
drivers:
some_format_name: some_package.and_maybe_a_submodule.YourDriverClass
You can also disable a troublesome driver
intake drivers disable some_format_name
which is equivalent to
drivers:
your_format_name: false
When Intake is imported, it will search the Python module path (by default includes site-packages and other directories in your $PYTHONPATH) for packages starting with intake\_ and discover DataSource subclasses inside those packages to register. drivers will be registered based on the``name`` attribute of the object. By convention, drivers should have names that are lowercase, valid Python identifiers that do not contain the word intake.
This approach is deprecated because it is limiting (requires the package to begin with "intake_") and because the package scan can be slow. Using entrypoints is strongly encouraged. The package scan may be disabled by default in some future release of intake. During the transition period, if a package named intake_* provides an entrypoint for a given name, that will take precedence over any drivers gleaned from the package scan having that name. If intake discovers any names from the package scan for which there are no entrypoints, it will issue a FutureWarning.
For drivers loading from files, the author should be aware that it is easy to implement loading from files stored in remote services. A simplistic case is demonstrated by the included CSV driver, which simply passes a URL to Dask, which in turn can interpret the URL as a remote data service, and use the storage_options as required (see the Dask documentation on remote data).
More advanced usage, where a Dask loader does not already exist, will likely rely on fsspec.open_files . Use this function to produce lazy OpenFile object for local or remote data, based on a URL, which will have a protocol designation and possibly contain glob "*" characters. Additional parameters may be passed to open_files, which should, by convention, be supplied by a driver argument named storage_options (a dictionary).
To use an OpenFile object, make it concrete by using a context:
# at setup, to discover the number of files/partitions set_of_open_files = fsspec.open_files(urlpath, mode='rb', **storage_options) # when actually loading data; here we loop over all files, but maybe we just do one partition for an_open_file in set_of_open_files:
# `with` causes the object to become concrete until the end of the block
with an_open_file as f:
# do things with f, which is a file-like object
f.seek(); f.read()
The textfiles builtin drivers implements this mechanism, as an example.
The CSV driver sets up an example of how to gather data which is encoded in file paths like ('data_{site}_.csv') and return that data in the output. Other drivers could also follow the same structure where data is being loaded from a set of filenames. Typically this would apply to data-frame output. This is possible as long as the driver has access to each of the file paths at some point in _get_schema. Once the file paths are known, the driver developer can use the helper functions defined in intake.source.utils to get the values for each field in the pattern for each file in the list. These values should then be added to the data, a process which normally would happen within the _get_schema method.
The PatternMixin defines driver properties such as urlpath, path_as_pattern, and pattern. The implementation might look something like this:
from intake.source.utils import reverse_formats class FooSource(intake.source.base.DataSource, intake.source.base.PatternMixin):
def __init__(self, a, b, path_as_pattern, urlpath, metadata=None):
# Do init here with a and b
self.path_as_pattern = path_as_pattern
self.urlpath = urlpath
super(FooSource, self).__init__(
container='dataframe',
metadata=metadata
)
def _get_schema(self):
# read in the data
values_by_field = reverse_formats(self.pattern, file_paths)
# add these fields and map values to the data
return data
Since dask already has a specific method for including the file paths in the output dataframe, in the CSV driver we set include_path_column=True, to get a dataframe where one of the columns contains all the file paths. In this case, add these fields and values to data is a mapping between the categorical file paths column and the values_by_field.
In other drivers where each file is read in independently the driver developer can set the new fields on the data from each file before concattenating. This pattern looks more like:
from intake.source.utils import reverse_format class FooSource(intake.source.base.DataSource):
...
def _get_schema(self):
# get list of file paths
for path in file_paths:
# read in the file
values_by_field = reverse_format(self.pattern, path)
# add these fields and values to the data
# concatenate the datasets
return data
To toggle on and off this path as pattern behavior, the CSV and intake-xarray drivers uses the bool path_as_pattern keyword argument.
Authorization plugins are classes that can be used to customize access permissions to the Intake catalog server. The Intake server and client communicate over HTTP, so when security is a concern, the most important step to take is to put a TLS-enabled reverse proxy (like nginx) in front of the Intake server to encrypt all communication.
Whether or not the connection is encrypted, the Intake server by default allows all clients to list the full catalog, and open any of the entries. For many use cases, this is sufficient, but if the visibility of catalog entries needs to be limited based on some criteria, a server- (and/or client-) side authorization plugin can be used.
An Intake server can have exactly one server side plugin enabled at startup. The plugin is activated using the Intake configuration file, which lists the class name and the keyword arguments it takes. For example, the "shared secret" plugin would be configured this way:
auth:
cls: intake.auth.secret.SecretAuth
kwargs:
secret: A_SECRET_HASH
This plugin is very simplistic, and exists as a demonstration of how an auth plugin might function for more realistic scenarios.
For more information about configuring the Intake server, see Configuration.
The server auth plugin has two methods. The allow_connect() method decides whether to allow a client to make any request to the server at all, and the allow_access() method decides whether the client is allowed to see a particular catalog entry in the listing and whether they are allowed to open that data source. Note that for catalog entries which allow direct access to the data (via network or shared filesystem), the Intake authorization plugins have no impact on the visibility of the underlying data, only the entries in the catalog.
The actual implementation of a plugin is very short. Here is a simplified version of the shared secret auth plugin:
class SecretAuth(BaseAuth):
def __init__(self, secret, key='intake-secret'):
self.secret = secret
self.key = key
def allow_connect(self, header):
try:
return self.get_case_insensitive(header, self.key, '') \
== self.secret
except:
return False
def allow_access(self, header, source, catalog):
try:
return self.get_case_insensitive(header, self.key, '') \
== self.secret
except:
return False
The header argument is a dictionary of HTTP headers that were present in the client request. In this case, the plugin is looking for a special intake-secret header which contains the shared secret token. Because HTTP header names are not case sensitive, the BaseAuth class provides a helper method get_case_insensitive(), which will match dictionary keys in a case-insensitive way.
The allow_access method also takes two additional arguments. The source argument is the instance of LocalCatalogEntry for the data source being checked. Most commonly auth plugins will want to inspect the _metadata dictionary for information used to make the authorization decision. Note that it is entirely up to the plugin author to decide what sections they want to require in the metadata section. The catalog argument is the instance of Catalog that contains the catalog entry. Typically, plugins will want to use information from the catalog.metadata dictionary to control global defaults, although this is also up to the plugin.
Although server side auth plugins can function entirely independently, some authorization schemes will require the client to add special HTTP headers for the server to look for. To facilitate this, the Catalog constructor accepts an optional auth parameter with an instance of a client auth plugin class.
The corresponding client plugin for the shared secret use case describe above looks like:
class SecretClientAuth(BaseClientAuth):
def __init__(self, secret, key='intake-secret'):
self.secret = secret
self.key = key
def get_headers(self):
return {self.key: self.secret}
It defines a single method, get_headers(), which is called to get a dictionary of additional headers to add to the HTTP request to the catalog server. To use this plugin, we would do the following:
import intake from intake.auth.secret import SecretClientAuth auth = SecretClientAuth('A_SECRET_HASH') cat = intake.Catalog('http://example.com:5000', auth=auth)
Now all requests made to the remote catalog will contain the intake-secret header.
Intake can used to create Data packages, so that you can easily distribute your catalogs - others can just "install data". Since you may also want to distribute custom catalogues, perhaps with visualisations, and driver code, packaging these things together is a great convenience. Indeed, packaging gives you the opportunity to version-tag your distribution and to declare the requirements needed to be able to use the data. This is a common pattern for distributing code for python and other languages, but not commonly seen for data artifacts.
The current version of Intake allows making data packages using standard python tools (to be installed, for example, using pip). The previous, now deprecated, technique is still described below, under Pure conda solution and is specific to the conda packaging system.
Intake allows you to register data artifacts (catalogs and data sources) in the metadata of a python package. This means, that when you install that package, intake will automatically know of the registered items, and they will appear within the "builtin" catalog intake.cat.
Here we assume that you understand what is meant by a python package (i.e., a folder containing __init__.py and other code, config and data files). Furthermore, you should familiarise yourself with what is required for bundling such a package into a distributable package (one with a setup.py) by reading the official packaging documentation
The intake examples contains a full tutorial for packaging and distributing intake data and/or catalogs for pip and conda, see the directory "data_package/".
Intake uses the concept of entry points to define the entries that are defined by a given package. Entry points provide a mechanism to register metadata about a package at install time, so that it can easily be found by other packages such as Intake. Entry points was originally a separate package, but is included in the standard library as of python 3.8 (you will not need to install it, as Intake requires it).
All you need to do to register an entry in intake.cat is:
entry_points={
'intake.catalogs': [
'sea_cat = intake_example_package:cat',
'sea_data = intake_example_package:data'
]
} Here only the lines with "sea_cat" and "sea_data" are specific to the example package, the rest is required boilerplate. Each of those two lines defines a name for the data entry (before the "=" sign) and the location to load from, in module:object format.
When Intake is imported, it investigates all registered entry points with the "intake.catalogs" group. It will go through and assign each name to the given location of the final object. In the above example, intake.cat.sea_cat would be associated with the cat object in the intake_example_package package, and so on.
Note that Intake does not immediately import the given package or module, because imports can sometimes be expensive, and if you have a lot of data packages, it might cause a slow-down every time that Intake is imported. Instead, a placeholder entry is created, and whenever the entry is accessed, that's when the particular package will be imported.
In [1]: import intake In [2]: intake.cat.sea_cat # does not import yet Out[2]: <Entry containing Catalog named sea_cat> In [3]: cat = intake.cat.sea_cat() # imports now In [4]: cat # this data source happens to be a catalog Out[4]: <Intake catalog: sea>
(note here the parentheses - this explicitly initialises the source, and normally you don't have to do this)
This packaging method is deprecated, but still available.
Combined with the Conda Package Manger, Intake makes it possible to create Data packages which can be installed and upgraded just like software packages. This offers several advantages:
In this tutorial, we give a walk-through to enable you to distribute any Catalogs to others, so that they can access the data using Intake without worrying about where it resides or how it should be loaded.
The function intake.catalog.default.load_combo_catalog searches for YAML catalog files in a number of place at import. All entries in these catalogs are flattened and placed in the "builtin" intake.cat.
The places searched are:
The steps involved in creating a data package are:
Data packages are standard conda packages that install an Intake catalog file into the user's conda environment ($CONDA_PREFIX/share/intake). A data package does not necessarily imply there are data files inside the package. A data package could describe remote data sources (such as files in S3) and take up very little space on disk.
These packages are considered noarch packages, so that one package can be installed on any platform, with any version of Python (or no Python at all). The easiest way to create such a package is using a conda build recipe.
Conda-build recipes are stored in a directory that contains a files like:
An example that packages up data from a Github repository would look like this:
# meta.yaml package:
version: '1.0.0'
name: 'data-us-states' source:
git_rev: v1.0.0
git_url: https://github.com/CivilServiceUSA/us-states build:
number: 0
noarch: generic requirements:
run:
- intake
build: [] about:
description: Data about US states from CivilServices (https://civil.services/)
license: MIT
license_family: MIT
summary: Data about US states from CivilServices
The key parts of a data package recipe (different from typical conda recipes) is the build section:
build:
number: 0
noarch: generic
This will create a package that can be installed on any platform, regardless of the platform where the package is built. If you need to rebuild a package, the build number can be incremented to ensure users get the latest version when they conda update.
The corresponding build.sh file in the recipe looks like this:
#!/bin/bash mkdir -p $CONDA_PREFIX/share/intake/civilservices cp $SRC_DIR/data/states.csv $PREFIX/share/intake/civilservices cp $RECIPE_DIR/us_states.yaml $PREFIX/share/intake/
The $SRC_DIR variable refers to any source tree checked out (from Github or other service), and the $RECIPE_DIR refers to the directory where the meta.yaml is located.
Finishing out this example, the catalog file for this data source looks like this:
sources:
states:
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/civilservices/states.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'
The {{ CATALOG_DIR }} Jinja2 variable is used to construct a path relative to where the catalog file was installed.
To build the package, you must have conda-build installed:
conda install conda-build
Building the package requires no special arguments:
conda build my_recipe_dir
Conda-build will display the path of the built package, which you will need to upload it.
If you want your data package to be publicly available on Anaconda Cloud, you can install the anaconda-client utility:
conda install anaconda-client
Then you can register your Anaconda Cloud credentials and upload the package:
anaconda login anaconda upload /Users/intake_user/anaconda/conda-bld/noarch/data-us-states-1.0.0-0.tar.bz2
As noted above, entries will appear in the users' builtin catalog as intake.cat.*. In the case that the catalog has multiple entries, it may be desirable to put the entries below a namespace as intake.cat.data_package.*. This can be achieved by having one catalog containing the (several) data sources, with only a single top-level entry pointing to it. This catalog could be defined in a YAML file, created using any other catalog driver, or constructed in the code, e.g.:
from intake.catalog import Catalog from intake.catalog.local import LocalCatalogEntry as Entry cat = intake.catalog.Catalog() cat._entries = {name: Entry(name, descr, driver='package.module.driver',
args={"urlpath": url})
for name, url in my_input_list}
If your package contains many sources of different types, you may even nest the catalogs, i.e., have a top-level whose contents are also catalogs.
e = Entry('first_cat', 'sample', driver='catalog') e._default_source = cat top_level = Catalog() top_level._entries = {'fist_cat': e, ...}
where your entry point might look something like: "my_cat = my_package:top_level". You could achieve the same with multiple YAML files.
Some high-level work that we expect to be achieved on the time-scale of months. This list is not exhaustive, but rather aims to whet the appetite for what Intake can be in the future.
Since Intake aims to be a community of data-oriented pythoneers, nothing written here is laid in stone, and users and devs are encouraged to make their opinions known!
Data-type drivers are easy to write, but still require some effort, and therefore reasonable impetus to get the work done. Conversations over the coming months can help determine the drivers that should be created by the Intake team, and those that might be contributed by the community.
The next type that we would specifically like to consider is machine learning model artifacts. EDIT see https://github.com/AlbertDeFusco/intake-sklearn , and hopefully more to come.
Many data sources are inherently time-sensitive and event-wise. These are not covered well by existing Python tools, but the streamz library may present a nice way to model them. From the Intake point of view, the task would be to develop a streaming type, and at least one data driver that uses it.
The most obvious place to start would be read a file: every time a new line appears in the file, an event is emitted. This is appropriate, for instance, for watching the log files of a web-server, and indeed could be extended to read from an arbitrary socket.
EDIT see: https://github.com/intake/intake-streamz
To add API endpoints to the server, so that a user (with sufficient privilege) can post data specifications to a running server, optionally saving the specs to a catalog server-side. Furthermore, we will consider the possibility of being able to upload and/or transform data (rather than refer to it in a third-party location), so that you would have a one-line "publish" ability from the client.
The server, in general, could do with a lot of work to become more than the current demonstration/prototype. In particular, it should be able to be performant and scalable, meaning that the server implementation ought to keep as little local state as possible.
We would like the make it easier to write Intake drivers which don't need any persist or GUI functionality, and to be able to install Intake core functionality (driver registry, data loading and catalog traversal) without needing many other packages at all.
EDIT this has been partly done, you can derive from DataSourceBase and not have to use the full set of Intake's features for simplicity. We have also gone some distance to separate out dependencies for parts of the package, so that you can install Intake and only use some of the subpackages/modules - imports don't happen until those parts of the code are used. We have not yet split the intake conda package into, for example, intake-base, intake-server, intake-gui...
For those that wish to provide Intake's data source API, and make data sources available to Intake cataloguing, but don't wish to take Intake as a direct dependency. The actual API of DataSources is rather simple:
Of these, only the first three are really necessary for a iminal interface, so Intake might do well to publish this protocol specification, so that new drivers can be written that can be used by Intake but do not need Intake, and so help adoption.
Intake is used and developed by individuals at a variety of institutions. It is open source (license) and sits within the broader Python numeric ecosystem commonly referred to as PyData or SciPy.
Conversation happens in the following places:
We welcome usage questions and bug reports from all users, even those who are new to using the project. There are a few things you can do to improve the likelihood of quickly getting a good answer.
If you have a general question about how something should work or want best practices then use Stack Overflow. If you think you have found a bug then use GitHub
Anaconda
2022, Anaconda
September 22, 2022 | 0.6.6 |