biothings.hub.dataload¶
biothings.hub.dataload.dumper¶
- class biothings.hub.dataload.dumper.APIDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Dump data from APIs
This will run API calls in a clean process and write its results in one or more NDJSON documents.
Populate the static methods get_document and get_release in your subclass, along with other necessary bits common to all dumpers.
For details on specific parts, read the docstring for individual methods.
An example subclass implementation can be found in the unii data source for MyGene.info.
- property client¶
- create_todump_list(force=False, **kwargs)[source]¶
This gets called by method dump, to populate self.to_dump
- download(remotefile, localfile)[source]¶
Runs helper function in new process to download data
This is run in a new process by the do_dump coroutine of the parent class. Then this will spawn another process that actually does all the work. This method is mostly for setting up the environment, setting up the the process pool executor to correctly use spawn and using concurrent.futures to simply run tasks in the new process, and periodically check the status of the task.
Explanation: because this is actually running inside a process that forked from a threaded process, the internal state is more or less corrupt/broken, see man 2 fork for details. More discussions are in Slack from some time in 2021 on why it has to be forked and why it is broken.
Caveats: the existing job manager will not know how much memory the actual worker process is using.
- static get_document() Generator[Tuple[str, Any], None, None] [source]¶
Get document from API source
Populate this method to yield documents to be stored on disk. Every time you want to save something to disk, do this: >>> yield ‘name_of_file.ndjson’, {‘stuff’: ‘you want saved’} While the type definition says Any is accepted, it has to be JSON serilizable, so basically Python dictionaries/lists with strings and numbers as the most basic elements.
Later on in your uploader, you can treat the files as NDJSON documents, i.e. one JSON document per line.
It is recommended that you only do the minimal necessary processing in this step.
A default HTTP client is not provided so you get the flexibility of choosing your most favorite tool.
This MUST be a static method or it cannot be properly serialized to run in a separate process.
This method is expected to be blocking (synchronous). However, be sure to properly SET TIMEOUTS. You open the resources here in this function so you have to deal with properly checking/closing them. If the invoker forcefully stops this method, it will leave a mess behind, therefore we do not do that.
You can do a 5 second timeout using the popular requests package by doing something like this: >>> import requests >>> r = requests.get(’https://example.org’, timeout=5.0) You can catch the exception or setup retries. If you cannot handle the situation, just raise exceptions or not catch them. APIDumper will handle it properly: documents are only saved when the entire method completes successfully.
- static get_release() str [source]¶
Get the string for the release information.
This is run in the main process and thread so it must return quickly. This must be populated
- Returns:
string representing the release.
- class biothings.hub.dataload.dumper.BaseDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
object
- ARCHIVE = True¶
- AUTO_UPLOAD = True¶
- MAX_PARALLEL_DUMP = None¶
- SCHEDULE = None¶
- SLEEP_BETWEEN_DOWNLOAD = 0.0¶
- SRC_NAME = None¶
- SRC_ROOT_FOLDER = None¶
- SUFFIX_ATTR = 'release'¶
- property client¶
- create_todump_list(force=False, **kwargs)[source]¶
Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download
- property current_data_folder¶
- property current_release¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- async dump(steps=None, force=False, job_manager=None, check_only=False, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_predicates()[source]¶
Return a list of predicates (functions returning true/false, as in math logic) which instructs/dictates if job manager should start a job (process/thread)
- property logger¶
- mark_success(dry_run=True)[source]¶
Mark the datasource as successful dumped. It’s useful in case the datasource is unstable, and need to be manually downloaded.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- post_download(remotefile, localfile)[source]¶
Placeholder to add a custom process once a file is downloaded. This is a good place to check file’s integrity. Optional
- post_dump(*args, **kwargs)[source]¶
Placeholder to add a custom process once the whole resource has been dumped. Optional.
- post_dump_delete_files()[source]¶
Delete files after dump
Invoke this method in post_dump to synchronously delete the list of paths stored in self.to_delete, in order.
Non-recursive. If directories need to be removed, build the list such that files residing in the directory are removed first and then the directory. (Hint: see os.walk(dir, topdown=False))
- release_client()[source]¶
Do whatever necessary (like closing network connection) to “release” the client
- remote_is_better(remotefile, localfile)[source]¶
Compared to local file, check if remote file is worth downloading. (like either bigger or newer for instance)
- property src_doc¶
- property src_dump¶
- to_delete: List[str | bytes | PathLike]¶
Populate with list of relative path of files to delete
- class biothings.hub.dataload.dumper.DockerContainerDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
Start a docker container (typically runs on a different server) to prepare the data file on the remote container, and then download this file to the local data source folder. This dumper will do the following steps: - Booting up a container from provided parameters: image, tag, container_name. - The container entrypoint will be override by this long running command: “tail -f /dev/null” - When the container_name and image is provided together, the dumper will try to run the container_name.
If the container with container_name does not exist, the dumper will start a new container from image param, and set its name as container_name.
- Run the dump_command inside this container. This command MUST block the dumper until the data file is completely prepare.
It will guarantees that the remote file is ready for downloading.
Run the get_version_cmd inside this container - if it provided. Set this command out put as self.release.
Download the remote file via Docker API, extract the downloaded .tar file.
- When the downloading is complete:
if keep_container=false: Remove the above container after.
if keep_container=true: leave this container running.
If there are any execption when dump data, the remote container won’t be removed, it will help us address the problem.
- These are supported connection types from the Hub server to the remote Docker host server:
ssh: Prerequisite: the SSH Key-Based Authentication is configured
unix: Local connection
http: Use an insecure HTTP connection over a TCP socket
- https: Use a secured HTTPS connection using TLS. Prerequisite:
The Docker API on the remote server MUST BE secured with TLS
A TLS key pair is generated on the Hub server and placed inside the same data plugin folder or the data source folder
All info about Docker client connection MUST BE defined in the config.py file, under the DOCKER_CONFIG key, Ex Optional DOCKER_HOST can be used to override the docker connections in any docker dumper regardless of the value of the src_url For example, you can set DOCKER_HOST=”localhost” for local testing:
- DOCKER_CONFIG = {
- “CONNECTION_NAME_1”: {
“tls_cert_path”: “/path/to/cert.pem”, “tls_key_path”: “/path/to/key.pem”, “client_url”: “https://remote-docker-host:port”
}, “CONNECTION_NAME_2”: {
“client_url”: “ssh://user@remote-docker-host”
}, “localhost”: {
“client_url”: “unix://var/run/docker.sock”
}
} DOCKER_HOST = “localhost”
- The data_url should match the following format:
docker://CONNECTION_NAME?image=DOCKER_IMAGE&tag=TAG&path=/path/to/remote_file&dump_command=”this is custom command”&container_name=CONTAINER_NAME&keep_container=true&get_version_cmd=”cmd” # NOQA
- Supported params:
image: (Optional) the Docker image name
tag: (Optional) the image tag
container_name: (Optional) If this param is provided, the image param will be discard when dumper run.
path: (Required) path to the remote file inside the Docker container.
dump_command: (Required) This command will be run inside the Docker container in order to create the remote file.
- keep_container: (Optional) accepted values: true/false, default: false.
If keep_container=true, the remote container will be persisted.
If keep_container=false, the remote container will be removed in the end of dump step.
- get_version_cmd: (Optional) The custom command for checking release version of local and remote file. Note that:
This command must run-able in both local Hub (for checking local file) and remote container (for checking remote file).
- “{}” MUST exists in the command, it will be replace by the data file path when dumper runs,
ex: get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” will be run as: md5sum /path/to/remote_file | awk ‘{ print $1 }’ and /path/to/local_file
- Ex:
docker://CONNECTION_NAME?image=IMAGE_NAME&tag=IMAGE_TAG&path=/path/to/remote_file(inside the container)&dump_command=”run something with output is written to -O /path/to/remote_file (inside the container)” # NOQA
docker://CONNECTION_NAME?container_name=CONTAINER_NAME&path=/path/to/remote_file(inside the container)&dump_command=”run something with output is written to -O /path/to/remote_file (inside the container)&keep_container=true&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
docker://localhost?image=dockstore_dumper&path=/data/dockstore_crawled/data.ndjson&dump_command=”/home/biothings/run-dockstore.sh”&keep_container=1
docker://localhost?image=dockstore_dumper&tag=latest&path=/data/dockstore_crawled/data.ndjson&dump_command=”/home/biothings/run-dockstore.sh”&keep_container=True # NOQA
docker://localhost?image=praqma/network-multitool&tag=latest&path=/tmp/annotations.zip&dump_command=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip”&keep_container=false&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
docker://localhost?container_name=<YOUR CONTAINER NAME>&path=/tmp/annotations.zip&dump_command=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip”&keep_container=true&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
Container metadata: - All above params in the data_url can be pre-config in the Dockerfile by adding LABELs. This config will be used as the fallback of the data_url params:
The dumper will find those params from both data_url and container metadata. If a param does not exist in data_url, dumper will use its value from container metadata (of course if it exist).
- For example:
… Dockerfile LABEL “path”=”/tmp/annotations.zip” LABEL “dump_command”=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip” LABEL keep_container=”true” LABEL desc=test LABEL container_name=mydocker
- CONTAINER_NAME = None¶
- DATA_PATH = None¶
- DOCKER_CLIENT_URL = None¶
- DOCKER_IMAGE = None¶
- DUMP_COMMAND = None¶
- GET_VERSION_CMD = None¶
- KEEP_CONTAINER = False¶
- MAX_PARALLEL_DUMP = 1¶
- ORIGINAL_CONTAINER_STATUS = None¶
- TIMEOUT = 300¶
- async create_todump_list(force=False, job_manager=None, **kwargs)[source]¶
Create the list of files to dump, called in dump method. This method will execute dump_command to generate the remote file in docker container, so we define this method as async to make it non-blocking.
- delete_or_restore_container()[source]¶
Delete the container if it’s created by the dumper, or restore it to its original status if it’s pre-existing.
- download(remote_file, local_file)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- generate_remote_file()[source]¶
Execute dump_command to generate the remote file, called in create_todump_list method
- get_remote_file()[source]¶
return the remote file path within the container. In most of cases, dump_command should either generate this file or check if it’s ready if there is another automated pipeline generates this file.
- get_remote_lastmodified(remote_file)[source]¶
get the last modified time of the remote file within the container using stat command
- post_dump(*args, **kwargs)[source]¶
Delete container or restore the container status if necessary, called in the dump method after the dump is done (during the “post” step)
- prepare_dumper_params()[source]¶
Read all docker dumper parameters from either the data plugin manifest or the Docker image or container metadata. Of course, at least one of docker_image or container_name parameters must be defined in the data plugin manifest first. If the parameter is not defined in the data plugin manifest, we will try to read it from the Docker image metadata.
- prepare_local_folders(localfile)[source]¶
prepare the local folder for the localfile, called in download method
- prepare_remote_container()[source]¶
prepare the remote container and set self.container, called in create_todump_list method
- remote_is_better(remote_file, local_file)[source]¶
Compared to local file, check if remote file is worth downloading. (like either bigger or newer for instance)
- set_release()[source]¶
call the get_version_cmd to get the releases, called in get_todump_list method. if get_version_cmd is not defined, use timestamp as the release.
This is currently a blocking method, assuming get_version_cmd is a quick command. But if necessary, we can make it async in the future.
- property source_config¶
- class biothings.hub.dataload.dumper.DummyDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
DummyDumper will do nothing… (useful for datasources that can’t be downloaded anymore but still need to be integrated, ie. fill src_dump, etc…)
- class biothings.hub.dataload.dumper.DumperManager(job_manager, datasource_path='dataload.sources', *args, **kwargs)[source]¶
Bases:
BaseSourceManager
- SOURCE_CLASS¶
alias of
BaseDumper
- call(src, method_name, *args, **kwargs)[source]¶
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- get_schedule(dumper_name)[source]¶
Return the corresponding schedule for dumper_name Example result: {
“cron”: “0 9 * * *”, “strdelta”: “15h:20m:33s”,
}
- class biothings.hub.dataload.dumper.FTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
- BLOCK_SIZE: int | None = None¶
- CWD_DIR = ''¶
- FTP_HOST = ''¶
- FTP_PASSWD = ''¶
- FTP_TIMEOUT = 600.0¶
- FTP_USER = ''¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.FilesystemDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
This dumpers works locally and copy (or move) files to datasource folder
- FS_OP = 'cp'¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.GitDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Git dumper gets data from a git repo. Repo is stored in SRC_ROOT_FOLDER (without versioning) and then versions/releases are fetched in SRC_ROOT_FOLDER/<release>
- DEFAULT_BRANCH = None¶
- GIT_REPO_URL = None¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- async dump(release='HEAD', force=False, job_manager=None, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- class biothings.hub.dataload.dumper.GoogleDriveDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
HTTPDumper
- download(remoteurl, localfile)[source]¶
- remoteurl is a google drive link containing a document ID, such as:
https://drive.google.com/open?id=<1234567890ABCDEF>
https://drive.google.com/file/d/<1234567890ABCDEF>/view
It can also be just a document ID
- class biothings.hub.dataload.dumper.HTTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Dumper using HTTP protocol and “requests” library
- IGNORE_HTTP_CODE = []¶
- RESOLVE_FILENAME = False¶
- VERIFY_CERT = True¶
- download(remoteurl, localfile, headers={})[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.LastModifiedBaseDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Use SRC_URLS as a list of URLs to download and implement create_todump_list() according to that list. Shoud be used in parallel with a dumper talking the actual underlying protocol
- SRC_URLS = []¶
- class biothings.hub.dataload.dumper.LastModifiedFTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
LastModifiedBaseDumper
SRC_URLS containing a list of URLs pointing to files to download, use FTP’s MDTM command to check whether files should be downloaded The release is generated from the last file’s MDTM in SRC_URLS, and formatted according to RELEASE_FORMAT. See also LastModifiedHTTPDumper, working the same way but for HTTP protocol. Note: this dumper is a wrapper over FTPDumper, one URL will give one FTPDumper instance.
- RELEASE_FORMAT = '%Y-%m-%d'¶
- download(urlremotefile, localfile, headers={})[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- release_client()[source]¶
Do whatever necessary (like closing network connection) to “release” the client
- class biothings.hub.dataload.dumper.LastModifiedHTTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
HTTPDumper
,LastModifiedBaseDumper
Given a list of URLs, check Last-Modified header to see whether the file should be downloaded. Sub-class should only have to declare SRC_URLS. Optionally, another field name can be used instead of Last-Modified, but date format must follow RFC 2616. If that header doesn’t exist, it will always download the data (bypass) The release is generated from the last file’s Last-Modified in SRC_URLS, and formatted according to RELEASE_FORMAT.
- ETAG = 'ETag'¶
- LAST_MODIFIED = 'Last-Modified'¶
- RELEASE_FORMAT = '%Y-%m-%d'¶
- RESOLVE_FILENAME = True¶
- class biothings.hub.dataload.dumper.ManualDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
This dumper will assist user to dump a resource. It will usually expect the files to be downloaded first (sometimes there’s no easy way to automate this process). Once downloaded, a call to dump() will make sure everything is fine in terms of files and metadata
- async dump(path, release=None, force=False, job_manager=None, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- class biothings.hub.dataload.dumper.WgetDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
- create_todump_list(force=False, **kwargs)[source]¶
Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download
- download(remoteurl, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
biothings.hub.dataload.source¶
- class biothings.hub.dataload.source.SourceManager(source_list, dump_manager, upload_manager, data_plugin_manager)[source]¶
Bases:
BaseSourceManager
Helper class to get information about a datasource, whether it has a dumper and/or uploaders associated.
- reset(name, key='upload', subkey=None)[source]¶
Reset, ie. delete, internal data (src_dump document) for given source name, key subkey. This method is useful to clean outdated information in Hub’s internal database.
- Ex: key=upload, name=mysource, subkey=mysubsource, will delete entry in corresponding
src_dump doc (_id=mysource), under key “upload”, for sub-source named “mysubsource”
“key” can be either ‘download’, ‘upload’ or ‘inspect’. Because there’s no such notion of subkey for dumpers (ie. ‘download’, subkey is optional.
biothings.hub.dataload.storage¶
biothings.hub.dataload.sync¶
Deprecated. This module is not used any more.
biothings.hub.dataload.uploader¶
- class biothings.hub.dataload.uploader.BaseSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
object
Default datasource uploader. Database storage can be done in batch or line by line. Duplicated records aren’t not allowed
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- config = <ConfigurationWrapper over <module 'config' from '/home/user/Desktop/shared/documentation/docs/biothings-biothings.api/venv/lib/python3.11/site-packages/biothings/hub/default_config.py'>>¶
- classmethod create(db_conn_info, *args, **kwargs)[source]¶
Factory-like method, just return an instance of this uploader (used by SourceManager, may be overridden in sub-class to generate more than one instance per class, like a true factory. This is usefull when a resource is splitted in different collection but the data structure doesn’t change (it’s really just data splitted accros multiple collections, usually for parallelization purposes). Instead of having actual class for each split collection, factory will generate them on-the-fly.
- property fullname¶
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_predicates()[source]¶
Return a list of predicates (functions returning true/false, as in math logic) which instructs/dictates if job manager should start a job (process/thread)
- keep_archive = 10¶
- async load(steps=('data', 'post', 'master', 'clean'), force=False, batch_size=10000, job_manager=None, **kwargs)[source]¶
Main resource load process, reads data from doc_c using chunk sized as batch_size. steps defines the different processes used to laod the resource: - “data” : will store actual data into single collections - “post” : will perform post data load operations - “master” : will register the master document in src_master
- load_data(data_path)[source]¶
Parse data from data_path and return structure ready to be inserted in database In general, data_path is a folder path. But in parallel mode (use parallelizer option), data_path is a file path :param data_path: It can be a folder path or a file path :return: structure ready to be inserted in database
- main_source = None¶
- make_temp_collection()[source]¶
Create a temp collection for dataloading, e.g., entrez_geneinfo_INEMO.
- name = None¶
- post_update_data(steps, force, batch_size, job_manager, **kwargs)[source]¶
Override as needed to perform operations after data has been uploaded
- prepare_src_dump()[source]¶
Sync with src_dump collection, collection information (src_doc) Return src_dump collection
- regex_name = None¶
- register_status(status, subkey='upload', **extra)[source]¶
Register step status, ie. status for a sub-resource
- storage_class¶
alias of
BasicStorage
- switch_collection()[source]¶
after a successful loading, rename temp_collection to regular collection name, and renaming existing collection to a temp name for archiving purpose.
- unprepare()[source]¶
reset anything that’s not pickable (so self can be pickled) return what’s been reset as a dict, so self can be restored once pickled
- class biothings.hub.dataload.uploader.DummySourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Dummy uploader, won’t upload any data, assuming data is already there but make sure every other bit of information is there for the overall process (usefull when online data isn’t available anymore)
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- class biothings.hub.dataload.uploader.IgnoreDuplicatedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Same as default uploader, but will store records and ignore if any duplicated error occuring (use with caution…). Storage is done using batch and unordered bulk operations.
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
IgnoreDuplicatedStorage
- class biothings.hub.dataload.uploader.MergerSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
MergerStorage
- class biothings.hub.dataload.uploader.NoBatchIgnoreDuplicatedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Same as default uploader, but will store records and ignore if any duplicated error occuring (use with caution…). Storage is done line by line (slow, not using a batch) but preserve order of data in input file.
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
NoBatchIgnoreDuplicatedStorage
- class biothings.hub.dataload.uploader.NoDataSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
This uploader won’t upload any data and won’t even assume there’s actual data (different from DummySourceUploader on this point). It’s usefull for instance when mapping need to be stored (get_mapping()) but data doesn’t comes from an actual upload (ie. generated)
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
NoStorage
- class biothings.hub.dataload.uploader.ParallelizedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- class biothings.hub.dataload.uploader.UploaderManager(poll_schedule=None, *args, **kwargs)[source]¶
Bases:
BaseSourceManager
After registering datasources, manager will orchestrate source uploading.
- SOURCE_CLASS¶
alias of
BaseSourceUploader
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- filter_class(klass)[source]¶
Gives opportunity for subclass to check given class and decide to keep it or not in the discovery process. Returning None means “skip it”.
- poll(state, func)[source]¶
Search for source in collection ‘col’ with a pending flag list containing ‘state’ and and call ‘func’ for each document found (with doc as only param)
- register_classes(klasses)[source]¶
Register each class in self.register dict. Key will be used to retrieve the source class, create an instance and run method from it. It must be implemented in subclass as each manager may need to access its sources differently,based on different keys.