biothings.hub.datatransform¶
biothings.hub.datatransform.ciidstruct¶
CIIDStruct - case insenstive id matching data structure
- class biothings.hub.datatransform.ciidstruct.CIIDStruct(field=None, doc_lst=None)[source]¶
Bases:
IDStruct
CIIDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.
This is a case-insensitive version of IDStruct.
Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)
biothings.hub.api.datatransform.datatransform_api¶
DataTransforAPI - classes around API based key lookup.
- class biothings.hub.datatransform.datatransform_api.BiothingsAPIEdge(lookup, fields, weight=1, label=None, url=None)[source]¶
Bases:
DataTransformEdge
APIEdge - IDLookupEdge object for API calls
Initialize the class :param label: A label can be used for debugging purposes.
- property client¶
property getter for client
- client_name = None¶
- class biothings.hub.datatransform.datatransform_api.DataTransformAPI(input_types, output_types, *args, **kwargs)[source]¶
Bases:
DataTransform
Perform key lookup or key conversion from one key type to another using an API endpoint as a data source.
This class uses biothings apis to conversion from one key type to another. Base classes are used with the decorator syntax shown below:
@IDLookupMyChemInfo(input_types, output_types) def load_document(doc_lst): for d in doc_lst: yield d
Lookup fields are configured in the ‘lookup_fields’ object, examples of which can be found in ‘IDLookupMyGeneInfo’ and ‘IDLookupMyChemInfo’.
- Required Options:
- input_types
‘type’
(‘type’, ‘nested_source_field’)
[(‘type1’, ‘nested.source_field1’), (‘type2’, ‘nested.source_field2’), …]
- output_types:
‘type’
[‘type1’, ‘type2’]
Additional Options: see DataTransform class
Initialize the IDLookupAPI object.
- batch_size = 10¶
- default_source = '_id'¶
- key_lookup_batch(batchiter)[source]¶
Look up all keys for ids given in the batch iterator (1 block) :param batchiter: 1 lock of records to look up keys for :return:
- lookup_fields = {}¶
- class biothings.hub.datatransform.datatransform_api.DataTransformMyChemInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]¶
Bases:
DataTransformAPI
Single key lookup for MyChemInfo
Initialize the class by seting up the client object.
- lookup_fields = {'chebi': 'chebi.chebi_id', 'chembl': 'chembl.molecule_chembl_id', 'drugbank': 'drugbank.drugbank_id', 'drugname': ['drugbank.name', 'unii.preferred_term', 'chebi.chebi_name', 'chembl.pref_name'], 'inchi': ['drugbank.inchi', 'chembl.inchi', 'pubchem.inchi'], 'inchikey': ['drugbank.inchi_key', 'chembl.inchi_key', 'pubchem.inchi_key'], 'pubchem': 'pubchem.cid', 'rxnorm': ['unii.rxcui'], 'unii': 'unii.unii'}¶
- output_types = ['inchikey', 'unii', 'rxnorm', 'drugbank', 'chebi', 'chembl', 'pubchem', 'drugname']¶
- class biothings.hub.datatransform.datatransform_api.DataTransformMyGeneInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]¶
Bases:
DataTransformAPI
deprecated
Initialize the class by seting up the client object.
- lookup_fields = {'ensembl': 'ensembl.gene', 'entrezgene': 'entrezgene', 'symbol': 'symbol', 'uniprot': 'uniprot.Swiss-Prot'}¶
- class biothings.hub.datatransform.datatransform_api.MyChemInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
Bases:
BiothingsAPIEdge
The MyChemInfoEdge uses the MyChem.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- client_name = 'drug'¶
- class biothings.hub.datatransform.datatransform_api.MyGeneInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
Bases:
BiothingsAPIEdge
The MyGeneInfoEdge uses the MyGene.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- client_name = 'gene'¶
biothings.hub.datatransform.datatransform_mdb¶
DataTransform MDB module - class for performing key lookup using conversions described in a networkx graph.
- class biothings.hub.datatransform.datatransform_mdb.CIMongoDBEdge(collection_name, lookup, field, weight=1, label=None)[source]¶
Bases:
MongoDBEdge
Case-insensitive MongoDBEdge
- Parameters:
collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- class biothings.hub.datatransform.datatransform_mdb.DataTransformMDB(graph, *args, **kwargs)[source]¶
Bases:
DataTransform
Convert document identifiers from one type to another.
The DataTransformNetworkX module was written as a decorator class which should be applied to the load_data function of a Biothings Uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.
- Parameters:
graph – nx.DiGraph (networkx 2.1) configuration graph
input_types – A list of input types for the form (identifier, field) where identifier matches a node and field is an optional dotstring field for where the identifier should be read from (the default is ‘_id’).
output_types (list(str)) – A priority list of identifiers to convert to. These identifiers should match nodes in the graph.
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
skip_on_failure (bool) – If True, documents where identifier conversion fails will be skipped in the final document list.
skip_w_regex (bool) – Do not perform conversion if the identifier matches the regular expression provided to this argument. By default, this option is disabled.
skip_on_success (bool) – If True, documents where identifier conversion succeeds will be skipped in the final document list.
idstruct_class (class) – Override an internal data structure used by the this module (advanced usage)
copy_from_doc (bool) – If true then an identifier is copied from the input source document regardless as to weather it matches an edge or not. (advanced usage)
- batch_size = 1000¶
- default_source = '_id'¶
- class biothings.hub.datatransform.datatransform_mdb.MongoDBEdge(collection_name, lookup, field, weight=1, label=None, check_index=True)[source]¶
Bases:
DataTransformEdge
The MongoDBEdge uses data within a MongoDB collection to convert one identifier to another. The input identifier is used to search a collection. The output identifier values are read out of that collection:
- Parameters:
collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- property collection¶
getting for collection member variable
- collection_find(id_lst, lookup, field)[source]¶
Abstract out (as one line) the call to collection.find
biothings.hub.datatransform.datatransform¶
DataTransform Module - IDStruct - DataTransform (superclass)
- class biothings.hub.datatransform.datatransform.DataTransform(input_types, output_types, id_priority_list=None, skip_on_failure=False, skip_w_regex=None, skip_on_success=False, idstruct_class=<class 'biothings.hub.datatransform.datatransform.IDStruct'>, copy_from_doc=False, debug=False)[source]¶
Bases:
object
DataTransform class. This class is the public interface for the DataTransform module. Much of the core logic is in the subclass.
Initialize the keylookup object and precompute paths from the start key to all target keys.
The decorator is intended to be applied to the load_data function of an uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.
- Parameters:
G – nx.DiGraph (networkx 2.1) configuration graph
collections – list of mongodb collection names
input_type – key type to start key lookup from
output_types – list of all output types to convert to
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
id_struct_class – IDStruct used to manager/fetch IDs from docs
copy_from_doc – if transform failed using the graph, try to get value from the document itself when output_type == input_type. No check is performed, it’s a straight copy. If checks are needed (eg. check that an ID referenced in the doc actually exists in another collection, nodes with self-loops can be used, so ID resolution will be forced to go through these loops to ensure data exists)
- DEFAULT_WEIGHT = 1¶
- batch_size = 1000¶
- debug = False¶
- default_source = '_id'¶
- property id_priority_list¶
Property method for getting id_priority_list
- key_lookup_batch(batchiter)[source]¶
Core method for looking up all keys in batch (iterator) :param batchiter: :return:
- lookup_one(doc)[source]¶
KeyLookup on document. This method is called as a function call instead of a decorator on a document iterator.
- class biothings.hub.datatransform.datatransform.DataTransformEdge(label=None)[source]¶
Bases:
object
DataTransformEdge. This class contains information needed to transform one key to another.
Initialize the class :param label: A label can be used for debugging purposes.
- edge_lookup(keylookup_obj, id_strct, debug=False)[source]¶
virtual method for edge lookup. Each edge class is responsible for its own lookup procedures given a keylookup_obj and an id_strct :param keylookup_obj: :param id_strct: - list of tuples (orig_id, current_id) :return:
- property logger¶
getter for the logger property
- class biothings.hub.datatransform.datatransform.IDStruct(field=None, doc_lst=None)[source]¶
Bases:
object
IDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.
Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)
- property id_lst¶
Build up a list of current ids
- class biothings.hub.datatransform.datatransform.RegExEdge(from_regex, to_regex, weight=1, label=None)[source]¶
Bases:
DataTransformEdge
The RegExEdge allows an identifier to be transformed using a regular expression. POSIX regular expressions are supported.
- Parameters:
from_regex (str) – The first parameter of the regular expression substitution.
to_regex (str) – The second parameter of the regular expression substitution.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
biothings.hub.datatransform.histogram¶
DataTransform Histogram class - track keylookup statistics