biothings.hub.datainspect¶
biothings.hub.datainspect.inspector¶
- class biothings.hub.datainspect.inspector.InspectorManager(upload_manager, build_manager, *args, **kwargs)[source]¶
Bases:
BaseManager
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- inspect(data_provider, mode='type', batch_size=10000, limit=None, sample=None, **kwargs)[source]¶
Inspect given data provider: - backend definition, see bt.hub.dababuild.create_backend for
supported format), eg “merged_collection” or (“src”,”clinvar”)
or callable yielding documents
Mode: - “type”: will inspect and report type map found in data (internal/non-standard format) - “mapping”: will inspect and return a map compatible for later
ElasticSearch mapping generation (see bt.utils.es.generate_es_mapping)
“stats”: will inspect and report types + different counts found in data, giving a detailed overview of the volumetry of each fields and sub-fields
“jsonschema”, same as “type” but result is formatted as json-schema standard
limit: when set to an integer, will inspect only x documents.
sample: combined with limit, for each document, if random.random() <= sample (float), the document is inspected. This option allows to inspect only a sample of data.