BCBIO_NEXTGEN(1) | bcbio-nextgen | BCBIO_NEXTGEN(1) |
bcbio_nextgen - bcbio_nextgen Documentation [image: bcbio banner] [image]
A python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel pipeline that handles distributed execution, idempotent processing restarts and safe transactional steps. The goal is to provide a shared community resource that handles the data processing component of sequencing analysis, providing researchers with more time to focus on the downstream biology.
Contents:
bcbio-nextgen provides best-practice pipelines for automated analysis of high throughput sequencing data with the goal of being:
A sample of institutions using bcbio-nextgen for solving biological problems. Please submit your story if you're using the pipeline in your own research.
We provide an automated script that installs third party analysis tools, required genome data and python library dependencies for running human variant and RNA-seq analysis, bundled into an isolated directory or virtual environment:
wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
--genomes GRCh37 --aligners bwa --aligners bowtie2
bcbio should install cleanly on Linux systems. For Mac OSX, we suggest trying bcbio-vm which runs bcbio on docs-cloud or isolates all the third party tools inside a Docker container. bcbio-vm is still a work in progress but not all of the dependencies bcbio uses install cleanly on OSX.
With the command line above, indexes and associated data files go in /usr/local/share/bcbio-nextgen and tools are in /usr/local. If you don't have write permissions to install into the /usr/local directories you can install in a user directory like ~/local or use sudo chmod to give your standard user permissions. Please don't run the installer with sudo or as the root user. Do not use directories with : in the name, it is not POSIX compliant and will cause installation failures.
The installation is highly customizable, and you can install additional software and data later using bcbio_nextgen.py upgrade. Run python bcbio_nextgen_install.py with no arguments to see options for configuring the installation process. Some useful arguments are:
export PATH=/path_to_bcbio/bin:$PATH
The machine will need to have some basic requirements for installing and running bcbio:
Optional tool specific requirements:
The bcbio-nextgen Dockerfile contains the packages needed to install on bare Ubuntu systems.
The automated installer creates a fully integrated environment that allows simultaneous updates of the framework, third party tools and biological data. This offers the advantage over manual installation of being able to manage and evolve a consistent analysis environment as algorithms continue to evolve and improve. Installing this way is as isolated and self-contained as possible without virtual machines or lightweight system containers like Docker. The Upgrade section has additional documentation on including additional genome data for supported bcbio genomes. For genome builds not included in the defaults, see the documentation on config-custom. Following installation, you should edit the pre-created system configuration file in /usr/local/share/bcbio-nextgen/galaxy/bcbio_system.yaml to match your local system or cluster configuration (see tuning-cores).
We use the same automated installation process for performing upgrades of tools, software and data in place. Since there are multiple targets and we want to avoid upgrading anything unexpectedly, we have specific arguments for each. Generally, you'd want to upgrade the code, tools and data together with:
bcbio_nextgen.py upgrade -u stable --tools --data
Tune the upgrade with these options:
bcbio installs associated data files for sequence processing, and you're able to customize this to install larger files or change the defaults. Use the --datatarget flag (potentially multiple times) to customize or add new targets.
By default, bcbio will install data files for variation, rnaseq and smallrna but you can sub-select a single one of these if you don't require other analyses. The available targets are:
For somatic analyses, bcbio includes COSMIC v68 for hg19 and GRCh37 only. Due to license restrictions, we cannot include updated versions of this dataset and hg38 support with the installer. To prepare these datasets yourself you can use a utility script shipped with cloudbiolinux that downloads, sorts and merges the VCFs, then copies into your bcbio installation:
export COSMIC_USER="your@registered.email.edu" export COSMIC_PASS="cosmic_password" bcbio_python prepare_cosmic.py 83 /path/to/bcbio
We're not able to automatically install some useful tools due to licensing restrictions, so we provide a mechanism to manually download and add these to bcbio-nextgen during an upgrade with the --toolplus command line.
bcbio includes an installation of GATK4, which is freely available for all uses. This is the default runner for HaplotypeCaller or MuTect2. If you want to use an older version of GATK, it requires manual installation. This is freely available for academic users, but requires a license for commerical use. It is not freely redistributable so requires a manual download from the GATK download site. You also need to include tools_off: [gatk4] in your configuration for runs: see config-changing-defaults.
To install GATK3, register with the pre-installed gatk bioconda wrapper:
gatk-register /path/to/GenomeAnalysisTK.tar.bz2
If you're not using the most recent post-3.6 version of GATK, or using a nightly build, you can add --noversioncheck to the command line to skip comparisons to the GATK version.
MuTect2 is distributed with GATK in versions 3.5 and later.
To install versions of GATK < 3.6, download and unzip the latest version from the GATK distribution. Then make this jar available to bcbio-nextgen with:
bcbio_nextgen.py upgrade --tools --toolplus gatk=/path/to/gatk/GenomeAnalysisTK.jar
This will copy the jar and update your bcbio_system.yaml and manifest files to reflect the new version.
MuTect also has similar licensing terms and requires a license for commerical use. After downloading the MuTect jar, make it available to bcbio:
bcbio_nextgen.py upgrade --tools --toolplus mutect=/path/to/mutect/mutect-1.1.7.jar
Note that muTect does not provide an easy way to query for the current version, so your input jar needs to include the version in the name.
bcbio-nextgen provides a wrapper around external tools and data, so the actual tools used drive the system requirements. For small projects, it should install on workstations or laptops with a couple Gb of memory, and then scale as needed on clusters or multicore machines.
Disk space requirements for the tools, including all system packages are under 4Gb. Biological data requirements will depend on the genomes and aligner indices used, but a suggested install with GRCh37 and bowtie/bwa2 indexes uses approximately 35Gb of storage during preparation and ~25Gb after:
$ du -shc genomes/Hsapiens/GRCh37/* 3.8G bowtie2 5.1G bwa 3.0G rnaseq-2014-05-02 3.0G seq 340M snpeff 4.2G variation 4.4G vep 23.5G total
Some steps retrieve third party tools from GitHub, which can run into issues if you're behind a proxy or block git ports. To instruct git to use https:// globally instead of git://:
$ git config --global url.https://github.com/.insteadOf git://github.com/
Most software tools used by bcbio require Java 1.8. bcbio distributes an OpenJDK Java build and uses it so you don't need to install anything. Older versions of GATK (< 3.6) and MuTect require a locally installed Java 1.7. If you have version incompatibilities, you'll see errors like:
Unsupported major.minor version 51.0
Fixing this requires either installing Java 1.7 for old GATK and MuTect or avoiding pointing to an incorrect java (unset JAVA_HOME). You can also tweak the java used by bcbio, described in the Automated installation section.
Import errors with tracebacks containing Python libraries outside of the bcbio distribution (/path/to/bcbio/anaconda) are often due to other conflicting Python installations. bcbio tries to isolate itself as much as possible but external libraries can get included during installation due to the PYTHONHOME or PYTHONPATH environmental variables or local site libraries. These commands will temporary unset those to get bcbio installed, after which it should ignore them automatically:
$ unset PYTHONHOME $ unset PYTHONPATH $ export PYTHONNOUSERSITE=1
Finally, having a .pydistutils.cfg file in your home directory can mess with where the libraries get installed. If you have this file in your home directory, temporarily renaming it to something else may fix your installation issue.
The manual process does not allow the in-place updates and management of third party tools that the automated installer makes possible. It's a more error-prone and labor intensive process. If you find you can't use the installer we'd love to hear why to make it more amenable to your system. If you'd like to develop against a bcbio installation, see the documentation on setting up a code-devel-infrastructure.
The code drives a number of next-generation sequencing analysis tools that you need to install on any machines involved in the processing. The CloudBioLinux toolkit provides automated scripts to help with installation for both software and associated data files:
fab -f cloudbiolinux/fabfile.py -H localhost install_biolinux:flavor=ngs_pipeline_minimal
You can also install them manually, adjusting locations in the resources section of your bcbio_system.yaml configuration file as needed. The CloudBioLinux infrastructure provides a full list of third party software installed with bcbio-nextgen in
`packages-conda.yaml`_, which lists all third party tools installed through Bioconda
In addition to existing bioinformatics software the pipeline requires associated data files for reference genomes, including pre-built indexes for aligners. The CloudBioLinux toolkit again provides an automated way to download and prepare these reference genomes:
fab -f data_fabfile.py -H localhost -c your_fabricrc.txt install_data_s3:your_biodata.yaml
The biodata.yaml file contains information about what genomes to download. The fabricrc.txt describes where to install the genomes by adjusting the data_files variable. This creates a tree structure that includes a set of Galaxy-style location files to describe locations of indexes:
├── galaxy │ ├── tool-data │ │ ├── alignseq.loc │ │ ├── bowtie_indices.loc │ │ ├── bwa_index.loc │ │ ├── sam_fa_indices.loc │ │ └── twobit.loc │ └── tool_data_table_conf.xml ├── genomes │ ├── Hsapiens │ │ ├── GRCh37 │ │ └── hg19 │ └── phiX174 │ └── phix └── liftOver
Individual genome directories contain indexes for aligners in individual sub-directories prefixed by the aligner name. This structured scheme helps manage aligners that don't have native Galaxy .loc files. The automated installer will download and set this up automatically:
`-- phix
|-- bowtie
| |-- phix.1.ebwt
| |-- phix.2.ebwt
| |-- phix.3.ebwt
| |-- phix.4.ebwt
| |-- phix.rev.1.ebwt
| `-- phix.rev.2.ebwt
|-- bowtie2
| |-- phix.1.bt2
| |-- phix.2.bt2
| |-- phix.3.bt2
| |-- phix.4.bt2
| |-- phix.rev.1.bt2
| `-- phix.rev.2.bt2
|-- bwa
| |-- phix.fa.amb
| |-- phix.fa.ann
| |-- phix.fa.bwt
| |-- phix.fa.pac
| |-- phix.fa.rbwt
| |-- phix.fa.rpac
| |-- phix.fa.rsa
| `-- phix.fa.sa
|-- novoalign
| `-- phix
|-- seq
| |-- phix.dict
| |-- phix.fa
| `-- phix.fa.fai
`-- ucsc
`-- phix.2bit
bcbio implements configurable SNP, indel and structural variant calling for germline populations. We include whole genome and exome evaluations against reference calls from the Genome in a Bottle consortium and Illumina Platinum Genomes project, enabling continuous assessment of new alignment and variant calling algorithms. We regularly report on these comparisons and continue to improve approaches as the community makes new tools available. Here is some of the research that contributes to the current implementation:
bcbio automates post-variant calling annotation to make the outputs easier to feed directly into your biological analysis. We annotate variant effects using snpEff or Variant Effect Predictor (VEP), and prepare a GEMINI database that associates variants with multiple external annotations in a SQL-based query interface. GEMINI databases have the most associated external information for human samples (GRCh37/hg19 and hg38) but are available for any organism with the database populated using the VCF INFO column and predicted effects.
The best approach to build a bcbio docs-config for germline calling is to use the automated-sample-config with one of the default templates:
You may also want to enable Structural variant calling for detection of larger events, which work with either caller. Another good source of inspiration are the configuration files from the example-pipelines, which may help identify other configuration variables of interest. A more complex setup with multiple callers and resolution of ensemble calls is generally only useful with a small population where you are especially concerned about sensitivity. Single caller detection with FreeBayes or GATK HaplotypeCaller provide good resolution of events.
When calling multiple samples, we recommend calling together to provide improved sensitivity and a fully squared off final callset. To associate samples together in a population add a metadata batch to the sample-configuration:
- description: Sample1
metadata:
batch: Batch1 - description: Sample2
metadata:
batch: Batch1
Batching samples results in output VCFs and GEMINI databases containing all merged sample calls. bcbio has two methods to call samples together:
- description: Sample1
algorithm:
variantcaller: gatk-haplotype
jointcaller: gatk-haplotype-joint
metadata:
batch: Batch1 - description: Sample2
algorithm:
variantcaller: gatk-haplotype
jointcaller: gatk-haplotype-joint
metadata:
batch: Batch1
bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using multiple SNP, indel and structural variant callers. A full evaluation of cancer calling validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM challenge. bcbio uses a majority voting ensemble approach to combining calls from multiple SNP and indel callers, and also flattens structural variant calls into a combined representation.
The example configuration for the example-cancer validation is a good starting point for setting up a tumor/normal run on your own dataset. The configuration works similarly to population based calling. Supply a consistent batch for tumor/normal pairs and mark them with the phenotype:
- description: your-tumor
algorithm:
variantcaller: [vardict, strelka2, mutect2]
metadata:
batch: batch1
phenotype: tumor - description: your-normal
algorithm:
variantcaller: [vardict, strelka2, mutect2]
metadata:
batch: batch1
phenotype: normal
Other config-cancer configuration options allow tweaking of the processing parameters. For pairs you want to analyze together, specify a consistent set of variantcaller options for both samples.
Cancer calling handles both tumor-normal paired calls and tumor-only calling. To specify a tumor-only sample, provide a single sample labeled with phenotype: tumor. Otherwise the configuration and setup is the same as with paired analyses. For tumor-only samples, bcbio will try to remove likely germline variants present in the public databases like 1000 genomes and ExAC, and not in COSMIC. This runs as long as you have a local GEMINI data installation (--datatarget gemini) and marks likely germline variants with a LowPriority filter. This post has more details on the approach and validation.
The standard variant outputs (sample-caller.vcf.gz) for tumor calling emphasize somatic differences, those likely variants unique to the cancer. If you have a tumor-only sample and GEMINI data installed, it will also output sample-caller-germline.vcf.gz, which tries to identify germline background mutations based on presence in public databases. If you have tumor/normal data and would like to also call likely germline mutations see the documentation on specifying a germline caller: Somatic with germline variants.
We're actively working on improving calling to better account for the heterogeneity and structural variability that define cancer genomes.
For tumor/normal somatic samples, bcbio can call both somatic (tumor-specific) and germline (pre-existing) variants. The typical outputs of Cancer variant calling are likely somatic variants acquired by the cancer, but pre-existing germline risk variants are often also diagnostic.
For tumor-only cases we suggest running standard Cancer variant calling. Tumor-only inputs mix somatic and germline variants, making it difficult to separate events. For small variants (SNPs and indels) bcbio will attempt to distinguish somatic and germline mutations using the presence of variants in population databases.
To option somatic and germline calls for your tumor/normal inputs, specify which callers to use for each step in the variant-config configuration:
description: your-normal variantcaller:
somatic: vardict
germline: freebayes
bcbio does a single alignment for the normal sample, then splits at the variant calling steps using this normal sample to do germline calling. In this example, the output files are:
Germline calling supports multiple callers, and other configuration options like ensemble and structural variant calling inherit from the remainder configuration. For example, to use 3 callers for somatic and germline calling, create ensemble calls for both and include germline and somatic events from two structural variant callers:
variantcaller:
somatic: [vardict, strelka2, mutect2]
germline: [freebayes, gatk-haplotype, strelka2] ensemble:
numpass: 2 svcaller: [manta, cnvkit]
In addition to the somatic and germline outputs attached to the tumor and normal sample outputs as described above, you'll get:
bcbio can detect larger structural variants like deletions, insertions, inversions and copy number changes for both germline population and cancer variant calling, based on validation against existing truth sets:
To enable structural variant calling, specify svcaller options in the algorithm section of your configuration:
- description: Sample
algorithm:
svcaller: [lumpy, manta, cnvkit]
The best supported callers are Lumpy and Manta, for paired end and split read calling, CNVkit for read-depth based CNV calling, and WHAM for association testing. We also support DELLY, another excellent paired end and split read caller, although it is slow on large whole genome datasets.
In addition to results from individual callers, bcbio can create a summarized ensemble callset using MetaSV. We're actively working on improved structural variant reporting to highlight potential variants of interest.
bcbio can also be used to analyze RNA-seq data. It includes steps for quality control, adapter trimming, alignment, variant calling, transcriptome reconstruction and post-alignment quantitation at the level of the gene and isoform.
We recommend using the STAR aligner for all genomes where there are no alt alleles. For genomes such as hg38 that have alt alleles, hisat2 should be used as it handles the alts correctly and STAR does not yet. Use Tophat2 only if you do not have enough RAM available to run STAR (about 30 GB).
Our current recommendation is to run adapter trimming only if using the Tophat2 aligner. Adapter trimming is very slow, and aligners that soft clip the ends of reads such as STAR and hisat2, or algorithms using pseudoalignments like Sailfish handle contaminant sequences at the ends properly. This makes trimming unnecessary. Tophat2 does not perform soft clipping so if using Tophat2, trimming must still be done.
Salmon, which is an extremely fast alignment-free method of quantitation, is run for all experiments. Salmon can accurately quantitate the expression of genes, even ones which are hard to quantitate with other methods (see this paper for example for Sailfish, which performs similarly to Salmon.). Salmon can also quantitate at the transcript level which can help gene-level analyses (see this paper for example). We recommend using the Salmon quantitation rather than the counts from featureCounts to perform downstream quantification.
Although we do not recommend using the featureCounts based counts, the alignments are still useful because they give you many more quality metrics than the quasi-alignments from Salmon.
After a bcbio RNA-seq run there will be in the upload directory a directory for each sample which contains a BAM file of the aligned and unaligned reads, a sailfish directory with the output of Salmon, including TPM values, and a qc directory with plots from FastQC and qualimap.
In addition to directories for each sample, in the upload directory there is a project directory which contains a YAML file describing some summary statistics for each sample and some provenance data about the bcbio run. In that directory is also a combined.counts file with the featureCounts derived counts per cell.
This mode of bcbio-nextgen quantitates transcript expression using Salmon and does nothing else. It is an order of magnitude faster or more than running the full RNA-seq analysis. The cost of the increased speed is that you will have much less information about your samples at the end of the run, which can make troubleshooting trickier. Invoke with analysis: fastrna-seq.
bcbio-nextgen supports universal molecular identifiers (UMI) based single-cell RNA-seq analyses. If your single-cell prep does not use universal molecular identifiers (UMI), you can most likely just run the standard RNA-seq pipeline and use the results from that. The UMI are used to discard reads which are possibly PCR duplicates and is very helpful for removing some of the PCR duplicate noise that can dominate single-cell experiments.
Unlike the standard RNA-seq pipeline, the single-cell pipeline expects the FASTQ input files to not be separated by cellular barcode, so each file is a mix of cells identified by a cellular barcode (CB), and unique reads from a transcript are identified with a UMI. bcbio-nextgen inspects each read, identifies the cellular barcode and UMI and puts them in the read name. Then the reads are aligned to the transcriptome with RapMap and the number of reads aligning to each transcript is counted for each cellular barcode. The output is a table of counts with transcripts as the rows and columns as the cellular barcodes for each input FASTQ file.
Optionally the reads can be quantitated with kallisto to output transcript compatibility counts rather than counts per gene (TCC paper).
To extract the UMI and cellular barcodes from the read, bcbio-nextgen needs to know where the UMI and the cellular barcode are expected to be in the read. Currently there is support for two schemes, the inDrop system from the Harvard single-cell core facility and CEL-seq. If bcbio-nextgen does not support your UMI and barcoding scheme, please open up an issue and we will help implement support for it.
Most of the heavy lifting for this part of bcbio-nextgen is implemented in the umis repository.
bcbio-nextgen also implements a configurable best-practices pipeline for smallRNA-seq quality controls, adapter trimming, miRNA/isomiR quantification and other small RNA detection.
The pipeline generates a _RMD_ template file inside report folder that can be rendered with knitr. An example of the report is at here. Count table (counts_mirna.tst) from mirbase miRNAs will be inside mirbase or final project folder. Input files for isomiRs package for isomiRs analysis will be inside each sample in mirbase folder.. If mirdeep2 can run, count table (counts_mirna_novel.tsv) for novel miRNAs will be inside mirdeep2 or final project folder. tdrmapper results will be inside each sample inside tdrmapper or final project folder.
The bcbio-nextgen implementation of ChIP-seq aligns, removes multimapping reads, calls peaks with a paired input file using MACS2 and outputs a set of greylist regions for filtering possible false peaks in regions of high depth in the input file.
This pipeline implements alignment and qc tools. Furthermore, it will run qsignature to detect possible duplicated samples, or mislabeling. It uses SNPs signature to create a distance matrix that helps easily to create groups. The project yaml file will show the number of total samples analyzed, the number of very similar samples, and samples that could be duplicated.
We will assume that you installed bcbio-nextgen with the automated installer, and so your default bcbio_system.yaml file is configured correctly with all of the tools pointing to the right places. If that is the case, to run bcbio-nextgen on a set of samples you just need to set up a YAML file that describes your samples and what you would like to do to them. Let's say that you have a single paired-end control lane, prepared with the Illumina TruSeq Kit from a human. Here is what a well-formed sample YAML file for that RNA-seq experiment would look like:
fc_date: '070113' fc_name: control_experiment upload:
dir: final details:
- files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
description: 'Control_rep1'
genome_build: GRCh37
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [truseq, polya]
strandedness: unstranded
fc_date and fc_name will be combined to form a prefix to name intermediate files, and can be set to whatever you like. upload is explained pretty well in the configuration documentation and the above will direct bcbio-nextgen to put the output files from the pipeine into the final directory. Under details is a list of sections each describing a sample to process. You can set many parameters under each section but most of the time just setting a few like the above is all that is necessary. analysis tells bcbio-nextgen to run the best-practice RNA-seq pipeline on this sample.
In the above, since there are two files, control_1.fastq and control_2.fastq will be automatically run as paired-end data. If you have single end data you can just supply one file and it will run as single-end. The description field will be used to eventually rename the files, so make it very evocative since you will be looking at it a lot later. genome_build is self-explanatory.
Sometimes you need a little bit more flexibility than the standard pipeline, and the algorithm section has many options to fine-tune the behavior of the algorithm. quality_format tells bcbio-nextgen what quality format your FASTQ inputs are using, if your samples were sequenced any time past 2009 or so, you probably want to set it to Standard. Adapter read-through is a problem in RNA-seq libraries, so we want to trim off possible adapter sequences on the ends of reads, so trim_reads is set to read_through, which will also trim off poor quality ends. Since your library is a RNA-seq library prepared with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible polyA tails, so adapters is set to both of those. strandedness can be set if your library was prepared in a strand-specific manner and can be set to firststrand, secondstrand or unstranded (the default).
Lets say you have a set of mouse samples to analyze and each sample is a single lane of single-end RNA-seq reads prepared using the NextEra kit. There are two case and two control samples. Here is a sample configuration file for that analysis:
fc_date: '070113' fc_name: mouse_analysis upload:
dir: final details:
- files: [/full/path/to/control_rep1.fastq]
description: 'Control_rep1'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/control_rep2.fastq]
description: 'Control_rep2'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/case_rep1.fastq]
description: 'Case_rep1'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/case_rep2.fastq]
description: 'Case_rep2'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
More samples are added just by adding more entries under the details section. This is tedious and error prone to do by hand, so there is an automated template system for common experiments. You could set up the previous experiment by making a mouse version of the illumina-rnaseq template file and saving it to a local file such as illumina-mouse-rnaseq.yaml. Then you can set up the sample file using the templating system:
bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis /full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq /full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq
If you had paired-end samples instead of single-end samples, you can still use the template system as long as the forward and reverse read filenames are the same, barring a _1 and _2. For example: control_1.fastq and control_2.fastq will be detected as paired and combined in the YAML file output by the templating system.
bcbio_nextgen.py -w template gatk-variant project1 sample1.bam sample2_1.fq sample2_2.fq
This uses a standard template (GATK best practice variant calling) to automate creation of a full configuration for all samples. See automated-sample-config for more details on running the script, and manually edit the base template or final output file to incorporate project specific configuration. The example pipelines provide a good starting point and the sample-configuration documentation has full details on available options.
bcbio_nextgen.py bcbio_sample.yaml -n 8
bcbio encourages a project structure like:
my-project/ ├── config ├── final └── work
with the input configuration in the config directory, the outputs of the pipeline in the final directory, and the actual processing done in the work directory. Run the bcbio_nextgen.py script from inside the work directory to keep all intermediates there. The final directory, relative to the parent directory of the work directory, is the default location specified in the example configuration files and gets created during processing. The final directory has all of the finished outputs and you can remove the work intermediates to cleanup disk space after confirming the results. All of these locations are configurable and this project structure is only a recommendation.
There are 3 logging files in the log directory within your working folder:
We supply example input configuration files for validation and to help in understanding the pipeline.
This input configuration runs whole genome variant calling using bwa, GATK HaplotypeCaller and FreeBayes. It uses a father/mother/child trio from the CEPH NA12878 family: NA12891, NA12892, NA12878. Illumina's Platinum genomes project has 50X whole genome sequencing of the three members. The analysis compares results against a reference NA12878 callset from NIST's Genome in a Bottle initiative.
To run the analysis do:
mkdir -p NA12878-trio-eval/config NA12878-trio-eval/input NA12878-trio-eval/work cd NA12878-trio-eval/config wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate.yaml cd ../input wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh bash NA12878-trio-wgs-validate-getdata.sh cd ../work bcbio_nextgen.py ../config/NA12878-trio-wgs-validate.yaml -n 16
This is a large whole genome analysis and meant to test both pipeline scaling and validation across the entire genome. It can take multiple days to run depending on available cores. It requires 300Gb for the input files and 1.3Tb for the work directory. Smaller examples below exercise the pipeline with less disk and computational requirements.
We also have a more extensive evaluation that includes 2 additional variant callers, Platypus and samtools, and 3 different methods of calling variants: single sample, pooled, and incremental joint calling. This uses the same input data as above but a different input configuration file:
mkdir -p NA12878-trio-eval/work_joint cd NA12878-trio-eval/config wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-joint.yaml cd ../work_joint bcbio_nextgen.py ../config/NA12878-trio-wgs-joint.yaml -n 16
This example calls variants on NA12878 exomes from EdgeBio's clinical sequencing pipeline, and compares them against reference materials from NIST's Genome in a Bottle initiative. This supplies a full regression pipeline to ensure consistency of calling between releases and updates of third party software. The pipeline performs alignment with bwa mem and variant calling with FreeBayes, GATK UnifiedGenotyper and GATK HaplotypeCaller. Finally it integrates all 3 variant calling approaches into a combined ensemble callset.
This is a large full exome example with multiple variant callers, so can take more than 24 hours on machines using multiple cores.
First get the input configuration file, fastq reads, reference materials and analysis regions:
mkdir -p NA12878-exome-eval cd NA12878-exome-eval wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-exome-methodcmp-getdata.sh bash NA12878-exome-methodcmp-getdata.sh
Finally run the analysis, distributed on 8 local cores, with:
cd work bcbio_nextgen.py ../config/NA12878-exome-methodcmp.yaml -n 8
The grading-summary.csv contains detailed comparisons of the results to the NIST reference materials, enabling rapid comparisons of methods.
This example calls variants using multiple approaches in a paired tumor/normal cancer sample from the ICGC-TCGA DREAM challenge. It uses synthetic dataset 3 which has multiple subclones, enabling detection of lower frequency variants. Since the dataset is freely available and has a truth set, this allows us to do a full evaluation of variant callers.
To get the data:
mkdir -p cancer-dream-syn3/config cancer-dream-syn3/input cancer-dream-syn3/work cd cancer-dream-syn3/config wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3.yaml cd ../input wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3-getdata.sh bash cancer-dream-syn3-getdata.sh
Run with:
cd ../work bcbio_nextgen.py ../config/cancer-dream-syn3.yaml -n 8
The configuration and data file has downloads for exome only and whole genome analyses. It enables exome by default, but you can use the larger whole genome evaluation by uncommenting the relevant parts of the configuration and retrieval script.
This example simulates somatic cancer calling using a mixture of two Genome in a Bottle samples, NA12878 as the "tumor" mixed with NA24385 as the background. The Hartwig Medical Foundation and Utrecht Medical Center generated this "tumor/normal" pair by physical mixing of samples prior to sequencing. The GiaB FTP directory has more details on the design and truth sets. The sample has variants at 15% and 30%, providing the ability to look at lower frequency mutations.
To get the data:
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-giab-na12878-na24385-getdata.sh bash cancer-giab-na12878-na24385-getdata.sh
Then run the analysis with:
cd work bcbio_nextgen.py ../config/cancer-giab-na12878-na24385.yaml -n 16
This example runs structural variant calling with multiple callers (Lumpy, Manta and CNVkit), providing a combined output summary file and validation metrics against NA12878 deletions. It uses the same NA12878 input as the whole genome trio example.
To run the analysis do:
mkdir -p NA12878-sv-eval cd NA12878-sv-eval wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-sv-getdata.sh bash NA12878-sv-getdata.sh cd work bcbio_nextgen.py ../config/NA12878-sv.yaml -n 16
This is large whole genome analysis and the timing and disk space requirements for the NA12878 trio analysis above apply here as well.
This example aligns and creates count files for use with downstream analyses using a subset of the SEQC data from the FDA's Sequencing Quality Control project.
Get the setup script and run it, this will download six samples from the SEQC project, three from the HBRR panel and three from the UHRR panel. This will require about 100GB of disk space for these input files. It will also set up a configuration file for the run, using the templating system:
wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/rnaseq-seqc-getdata.sh bash rnaseq-seqc-getdata.sh
Now go into the work directory and run the analysis:
cd seqc/work bcbio_nextgen.py ../config/seqc.yaml -n 8
This will run a full scale RNAseq experiment using Tophat2 as the aligner and will take a long time to finish on a single machine. At the end it will output counts, Cufflinks quantitation and a set of QC results about each lane. If you have a cluster you can parallelize it to speed it up considerably.
A nice looking standalone report of the bcbio-nextgen run can be generated using bcbio.rnaseq. Check that repository for details.
Validate variant calling on human genome build 38, using two different builds (with and without alternative alleles) and three different validation datasets (Genome in a Bottle prepared with two methods and Illumina platinum genomes). To run:
mkdir -p NA12878-hg38-val cd NA12878-hg38-val wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-hg38-validate-getdata.sh bash NA12878-hg38-validate-getdata.sh mkdir -p work cd work bcbio_nextgen.py ../config/NA12878-hg38-validate.yaml -n 16
An input configuration for running whole gnome variant calling with bwa and GATK, using Illumina's Platinum genomes project (NA12878-illumina.yaml). See this blog post on whole genome scaling for expected run times and more information about the pipeline. To run the analysis:
├── config │ └── NA12878-illumina.yaml ├── input └── work
cd input wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gz
cd config wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-illumina.yaml
cd work bcbio_nextgen.py ../config/NA12878-illumina.yaml -n 16
The test suite exercises the scripts driving the analysis, so are a good starting point to ensure correct installation. Tests use the pytest framework. The tests are available in the bcbio source code:
$ git clone https://github.com/bcbio/bcbio-nextgen.git
There is a small wrapper script that finds the py.test and other dependencies pre-installed with bcbio you can use to run tests:
$ cd tests $ ./run_tests.sh
You can use this to run specific test targets:
$ ./run_tests.sh cancer $ ./run_tests.sh rnaseq $ ./run_tests.sh devel $ ./run_tests.sh docker
Optionally, you can run pytest directly from the bcbio install to tweak more options. It will be in /path/to/bcbio/anaconda/bin/py.test. Pass -s to py.test to see the stdout log, and -v to make py.test output more verbose. The tests are marked with labels which you can use to run a specific subset of the tests using the -m argument:
$ py.test -m rnaseq
To run unit tests:
$ py.test tests/unit
To run integration pipeline tests:
$ py.test tests/integration
To run tests which use bcbio_vm:
$ py.test tests/bcbio_vm
To see the test coverage, add the --cov=bcbio argument to py.test.
By default the test suite will use your installed system configuration for running tests, substituting the test genome information instead of using full genomes. If you need a specific testing environment, copy tests/data/automated/post_process-sample.yaml to tests/data/automated/post_process.yaml to provide a test-only configuration.
Two configuration files, in easy to write YAML format, specify details about your system and samples to run:
Commented system and sample example files are available in the config directory. The example-pipelines section contains additional examples of ready to run sample files.
bcbio-nextgen provides a utility to create configuration files for multiple sample inputs using a base template. Start with one of the best-practice templates, or define your own, then apply to multiple samples using the template workflow command:
bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq
samplename,description,batch,phenotype,sex,variant_regions sample1,ERR256785,batch1,normal,female,/path/to/regions.bed sample2,ERR256786,batch1,tumor,,/path/to/regions.bed
The first column links the metadata to a specific input file. The template command tries to identify the samplename from read group information in a BAM file, or uses the base filename if no read group information is present. For BAM files, this would be the filename without the extension and path (/path/to/yourfile.bam => yourfile). For FASTQ files, the template functionality will identify pairs using standard conventions (_1 and _2, including Illumina extensions like _R1), so use the base filename without these (/path/to/yourfile_R1.fastq => yourfile). Note that paired-end samples sequentially numbered without leading zeros (e.g., sample_1_1.fastq, sample_1_2.fastq, sample_2_1.fastq, sample_2_2.fastq, etc., will likely not be parsed correctly; see #1919 for more info). In addition, . characters could be problematic, so it's better to avoid this character and use it only as separation for the file extension.
samplename,description,phenotype,batch,sex,ethnicity,maternal_id,paternal_id,family_id NA12878.bam,NA12878,-9,CEPH,female,-9,NA12892,NA12891,NA12878FAM
Individual column items can contain booleans (true or false), integers, or lists (separated by semi-colons). These get converted into the expected time in the output YAML file. For instance, to specify a sample that should go into multiple batches:
samplename,description,phenotype,batch normal.bam,two_normal,normal,Batch1;Batch2
For dictionary inputs like somatic-w-germline-variants setups, you can separate items in a dictionary with colons and double colons, and also use semicolons for lists:
samplename,description,phenotype,variantcaller tumor.bam,sample1,tumor,germline:freebayes;gatk-haplotype::somatic:vardict;freebayes
The name of the metadata file, minus the .csv extension, is a short name identifying the current project. The script creates a project1 directory containing the sample configuration in project1/config/project1.yaml.
To make it easier to define your own project specific template, an optional first step is to download and edit a local template. First retrieve a standard template:
bcbio_nextgen.py -w template freebayes-variant project1
This pulls the current GATK best practice variant calling template into your project directory in project1/config/project1-template.yaml. Manually edit this file to define your options, then run the full template creation for your samples, pointing to this custom configuration file:
bcbio_nextgen.py -w template project1/config/project1-template.yaml project1.csv folder/*
If your sample folder contains additional BAM or FASTQ files you do not wish to include in the sample YAML configuration, you can restrict the output to only include samples in the metadata CSV with --only-metadata. The output will print warnings about samples not present in the metadata file, then leave these out of the final output YAML:
bcbio_nextgen.py -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*
In case you have multiple FASTQ or BAM files for each sample you can use bcbio_prepare_samples.py. The main parameters are:
samplename,description,batch,phenotype,sex,variant_regions file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed
An example of usage is:
bcbio_prepare_samples.py --out merged --csv project1.csv
The script will create the sample1.fastq,sample2.fastq in the merged folder, and a new CSV file in the same folder than the input CSV :project1-merged.csv. Later, it can be used for bcbio:
bcbio_nextgen.py -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq
The new CSV file will look like:
samplename,description,batch,phenotype,sex,variant_regions sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed
It supports parallelization the same way bcbio_nextgen.py does:
python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1
See more examples at parallelize pipeline.
In case of paired reads, the CSV file should contain all files:
samplename,description,batch,phenotype,sex,variant_regions file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
The script will try to guess the paired files the same way that bcbio_nextgen.py -w template does. It would detect paired files if the difference among two files is only _R1/_R2 or -1/-2 or _1/_2 or .1/.2
The output CSV will look like and is compatible with bcbio:
samplename,description,batch,phenotype,sex,variant_regions sample1,sample1,batch1,normal,female,/path/to/regions.bed
The sample configuration file defines details of each sample to process:
details:
- analysis: variant2
lane: 1
description: Example1
files: [in_pair_1.fq, in_pair_2.fq]
genome_build: hg19
algorithm:
platform: illumina
metadata:
batch: Batch1
sex: female
platform_unit: flowcell-barcode.lane
library: library_type
The upload section of the sample configuration file describes where to put the final output files of the pipeline. At its simplest, you can configure bcbio-nextgen to upload results to a local directory, for example a folder shared amongst collaborators or a Dropbox account. You can also configure it to upload results automatically to a Galaxy instance, to Amazon S3 or to iRODS. Here is the simplest configuration, uploading to a local directory:
upload:
dir: /local/filesystem/directory
General parameters, always required:
Galaxy parameters:
Here is an example configuration for uploading to a Galaxy instance. This assumes you have a shared mounted filesystem that your Galaxy instance can also access:
upload:
method: galaxy
dir: /path/to/shared/galaxy/filesystem/folder
galaxy_url: http://url-to-galaxy-instance
galaxy_api_key: YOURAPIKEY
galaxy_library: data_library_to_upload_to
Your Galaxy universe_wsgi.ini configuration needs to have allow_library_path_paste = True set to enable uploads.
S3 parameters:
For S3 access credentials, set the standard environmental variables, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION or use IAM access roles with an instance profile on EC2 to give your instances permission to create temporary S3 access.
iRODS parameters:
example configuration
Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured connection through the iinit command.
You can define files used multiple times in the algorithm section of your configuration in a top level globals dictionary. This saves copying and pasting across the configuration and makes it easier to manually adjust the configuration if inputs change:
globals:
my_custom_locations: /path/to/file.bed details:
- description: sample1
algorithm:
variant_regions: my_custom_locations
- description: sample2
algorithm:
variant_regions: my_custom_locations
The YAML configuration file provides a number of hooks to customize analysis in the sample configuration file. Place these under the algorithm keyword.
`<https://github.com/jdidion/atropos> atropos`_
These BED files define the regions of the genome to analyze and report on. variant_regions adjusts regions for small variant (SNP and indel) calling. sv_regions defines regions for structural variant calling if different than variant_regions. For coverage-based quality control metrics, we first use coverage if specified, then sv_regions if specified, then variant_regions. See the section on Input file preparation for tips on ensuring chromosome naming in these files match your reference genome. bcbio pre-installs some standard BED files for human analyses. Reference these using the naming schemes described in the reference data repository.
ploidy:
default: 2
mitochondrial: 1
female: 2
male: 1
`dbscSNV`_
bcbio installs pre-prepared configuration files in genomes/build/config/vcfanno or you can specify the full path to a /path/your/anns.conf and optionally an equivalently named /path/your/anns.lua file. This value can be a list for multiple inputs.
bcbio pre-installs standard truth sets for performing validation, and also allows use of custom local files for assessing reliability of your runs:
Each option can be either the path to a local file, or a partial path to a file in the pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a Bottle truth set:
validate: giab-NA12878/truth_small_variants.vcf.gz validate_regions: giab-NA12878/truth_regions.bed svvalidate:
DEL: giab-NA12878/truth_DEL.bed
follow the same naming schemes for small variants, regions and different structural variant types. bcbio has the following validation materials for germline validations:
and for cancer validations:
For more information on the hg38 truth set preparation see the work on validation on build 38 and conversion of human build 37 truth sets to build 38. The installation recipes contain provenance details about the origins of the installed files.
Unique molecular identifiers (UMIs) are short random barcodes used to tag individual molecules and avoid amplification biased. Both single cell RNA-seq and variant calling support UMIs. For variant calling, fgbio collapses sequencing duplicates for each UMI into a single consensus read prior to running re-alignment and variant calling. This requires mark_duplicates: true (the default) since it uses position based duplicates and UMI tags for collapsing duplicate reads into consensus sequences.
To help with preparing fastq files with UMIs bcbio provides a script bcbio_fastq_umi_prep.py. This handles two kinds of UMI barcodes:
In both cases, these get converted into paired reads with UMIs in the fastq names, allowing specification of umi_type: fastq_name in your bcbio YAML configuration. The script runs on a single set of files or autopairs an entire directory of fastq files. To convert a directory with separate UMI files:
bcbio_fastq_umi_prep.py autopair -c <cores_to_use> <list> <of> <fastq> <files>
To convert duplex barcodes present on the ends of read 1 and read 2:
bcbio_fastq_umi_prep.py autopair -c <cores_to_use> --tag1 5 --tag2 5 <list> <of> <fastq> <files>
Configuration options for UMIs:
You can adjust the fgbio default options by adjusting Resources. The most common change is modifying the minimum number of reads as input to consensus sequences. This default to 1 to avoid losing reads but you can set to larger values for high depth panels:
resources:
fgbio:
options: [--min-reads, 2]
bcbiornaseq:
organism: homo sapiens
interesting_groups: [treatment, genotype, etc, etc]
You will need to also turn on bcbiornaseq by turning it on via tools_on: [bcbiornaseq].
You can pass different parameters for macs2 adding to Resources:
resources:
macs2:
options: ["--broad"]
resources:
preseq:
extrap_fraction: 5
steps: 500
seg_len: 5000
And you can also set extrap and step parameters directly, as well as provide any other command line option via options:
resources:
preseq:
extrap: 10000000
step: 30000
options: ["-D"]
resources:
multiqc:
options: ["--cl_config", "'read_count_multiplier: 1'"]
or as thousands:
resources:
multiqc:
options: ["--cl_config", "'{read_count_multiplier: 0.001, read_count_prefix: K}'"]
bcbio provides some hints to change default behavior be either turning specific defaults on or off, with tools_on and tools_off. Both can be lists with multiple options:
In addition to single method variant calling, we support calling with multiple calling methods and consolidating into a final Ensemble callset.
The recommended method to do this uses a simple majority rule ensemble classifier that builds a final callset based on the intersection of calls. It selects variants represented in at least a specified number of callers:
variantcaller: [mutect2, varscan, freebayes, vardict] ensemble:
numpass: 2
use_filtered: false
This example selects variants present in 2 out of the 4 callers and does not use filtered calls (the default behavior). Because of the difficulties of producing a unified FORMAT/genotype field across callers, the ensemble outputs contains a mix of outputs from the different callers. It picks a representative sample in the order of specified caller, so in the example above would have a MuTect2 call if present, otherwise a VarScan call if present, otherwise a FreeBayes call. This may require custom normalization scripts during post-processing when using these calls. bcbio.variation.recall implements this approach, which handles speed and file sorting limitations in the bcbio.variation approach.
This older approach uses the bcbio.variation toolkit to perform the consolidation. An example configuration in the algorithm section is:
variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan] ensemble:
format-filters: [DP < 4]
classifier-params:
type: svm
classifiers:
balance: [AD, FS, Entropy]
calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
trusted-pct: 0.65
The ensemble set of parameters configure how to combine calls from the multiple methods:
The resources section allows customization of locations of programs and memory and compute resources to devote to them:
resources:
bwa:
cores: 12
cmd: /an/alternative/path/to/bwa
samtools:
cores: 16
memory: 2G
gatk:
jvm_opts: ["-Xms2g", "-Xmx4g"]
resources:
sentieon:
keyfile:
SENTIEON_LICENSE_SERVER: 100.100.100.100:8888
SENTIEON_AUTH_MECH: XXX
SENTIEON_AUTH_DATA: signature
You also use the resource section to specify system specific parameters like global temporary directories:
resources:
tmp:
dir: /scratch
This is useful on cluster systems with large attached local storage, where you can avoid some shared filesystem IO by writing temporary files to the local disk. When setting this keep in mind that the global temporary disk must have enough space to handle intermediates. The space differs between steps but generally you'd need to have 2 times the largest input file per sample and account for samples running simultaneously on multiple core machines.
To handle clusters that specify local scratch space with an environmental variable, bcbio will resolve environmental variables like:
resources:
tmp:
dir: $YOUR_SCRATCH_LOCATION
To override any of the global resource settings in a sample specific manner, you write a resource section within your sample YAML configuration. For example, to create a sample specific temporary directory and pass a command line option to novoalign, write a sample resource specification like:
- description: Example
analysis: variant2
resources:
novoalign:
options: ["-o", "FullNW", "--rOQ"]
tmp:
dir: tmp/sampletmpdir
To adjust resources for an entire run, you can add this resources specification at the top level of your sample YAML:
details:
- description: Example resources:
default:
cores: 16
By default, bcbio writes the logging-output directory to log in the main directory of the run. You can configure this to a different location in your bcbio-system.yaml with:
log_dir: /path/to/logs
Input files for supplementing analysis, like variant_regions need to match the specified reference genome. A common cause of confusion is the two chromosome naming schemes for human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI style in GRCh37 (1, 2). To help avoid some of this confusion, in build 38 we only support the commonly agreed on chr1, chr2 style.
It's important to ensure that the chromosome naming in your input files match those in the reference genome selected. bcbio will try to detect this and provide helpful errors if you miss it.
To convert chromosome names, you can use Devon Ryan's collection of chromosome mappings as an input to sed. For instance, to convert hg19 chr-style coordinates to GRCh37:
wget --no-check-certificate -qO- http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \
| awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed sed -f remap.sed original.bed > final.bed
Each genome build has an associated buildname-resources.yaml configuration file which contains organism specific naming and resource files. bcbio-nextgen expects a resource file present next to the genome FASTA file. Example genome configuration files are available, and automatically installed for natively supported genomes. Create these by hand to support additional organisms or builds.
The major sections of the file are:
By default, we place the buildname-resources.yaml files next to the genome FASTA files in the reference directory. For custom setups, you specify an alternative directory in the ref:config-resources section of your bcbio_system.yaml file:
resources:
genome:
dir: /path/to/resources/files
The pipeline requires access to reference genomes, including the raw FASTA sequence and pre-built indexes for aligners. The automated installer will install reference files and indexes for commonly used genomes (see the upgrade-install documentation for command line options).
For human genomes, we recommend using build 38 (hg38). This is fully supported and validated in bcbio, and corrects a lot of issues in the previous build 37. We use the 1000 genomes distribution which includes HLAs and decoy sequences. For human build 37, GRCh37 and hg19, we use the 1000 genome references provided in the GATK resource bundle. These differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while GRCh37 uses 1, 2, 3. They also differ slightly in content: GRCh37 has masked Pseudoautosomal regions on chromosome Y allowing alignment to these regions on chromosome X.
You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these resources. We use the Galaxy .loc files approach to describing the location of the sequence and index data, as described in data-requirements. This does not require a Galaxy installation since the installer sets up a minimal set of .loc files. It finds these by identifying the root galaxy directory, in which it expects a tool-data sub-directory with the .loc files. It can do this in two ways:
To manually make genomes available to bcbio-nextgen, edit the individual .loc files with locations to your reference and index genomes. You need to edit sam_fa_indices.loc to point at the FASTA files and then any genome indexes corresponding to aligners you'd like to use (for example: bwa_index.loc for bwa and bowtie2_indices.loc for bowtie2). The database key names used (like GRCh37 and mm10) should match those used in the genome_build of your sample input configuration file.
bcbio_setup_genome.py will help you to install a custom genome and apply all changes needed to the configuration files. It needs the genome in FASTA format, and the annotation file in GTF or GFF3 format. It can create index for all aligners used by bcbio. Moreover, it will create the folder rnaseq to allow you run the RNAseq pipeline without further configuration.
bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135
If you want to add smallRNA-seq data files, you will need to add the 3 letters code of mirbase for your genome (i.e hsa for human) and the GTF file for the annotation of smallRNA data. Here you can use the same file than the transcriptome if no other available.
bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf
To use that genome just need to configure your YAML files as:
genome_build: WBcel135
To perform variant calling and predict effects in a custom genome you'd have to manually download and link this into your installation. First find the snpEff genome build:
$ snpEff databases | grep Lactobacillus | grep pentosus Lactobacillus_pentosus_dsm_20314 Lactobacillus_pentosus_dsm_20314 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip Lactobacillus_pentosus_kca1 Lactobacillus_pentosus_kca1 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
then download to the appropriate location:
$ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus $ mkdir snpEff $ cd snpEff $ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip $ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip $ find . -name "Lactobacillus_pentosus_dsm_20314"
./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 $ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .
finally add to your genome configuration file (seq/Lactobacillus_pentosus-resources.yaml):
aliases:
snpeff: Lactobacillus_pentosus_dsm_20314
For adding an organism not present in snpEff, please see this mailing list discussion.
The pipeline runs in parallel in two different ways:
bcbio has two ways to specify core usage, helping provide options for parallelizing different types of processes:
resources:
default:
memory: 2G
cores: 16
jvm_opts: ["-Xms750m", "-Xmx2000m"]
bcbio uses these settings, along with memory requests, to determine how to partition jobs. For example, if you had -n 32 and cores: 16 for a run on a single 32 core machine, this would run two simultaneous bwa mapping jobs using 16 cores each.
Memory specifications (both in memory and jvm_opts) are per-core. bcbio takes care of adjusting this memory to match the cores used. In the example above, if bcbio was running a 16 core java process, it would use 32Gb of memory for the JVM, adjusting Xmx and Xms to match cores used. Internally bcbio looks at the memory and CPU usage on a machine and matches your configuration options to the available system resources. It will scale down core requests if memory is limiting, avoiding over-scheduling resources during the run. You ideally want to set both memory and jvm_opts to match the average memory per core on the run machine and adjust upwards if this does not provide enough memory for some processes during the run.
For single machine runs with a small number of samples, you generally want to set cores close to or equal the number of total cores you're allocating to the job with -n. This will allow individual samples to process as fast as possible and take advantage of multicore software.
For distributed jobs, you want to set cores to match the available cores on a single node in your cluster, then use -n as a multiple of this to determine how many nodes to spin up. For example, cores: 16 and -n 64 would try to make four 16 core machines available for analysis.
Running using multiple cores only requires setting the -n command line flag:
bcbio_nextgen.py bcbio_sample.yaml -t local -n 12
IPython parallel provides a distributed framework for performing parallel computation in standard cluster environments. The bcbio-nextgen setup script installs both IPython and pyzmq, which provides Python bindings for the ZeroMQ messaging library. The only additional requirement is that the work directory where you run the analysis is accessible to all processing nodes. This is typically accomplished with a distributed file system like NFS, Gluster or Lustre.
Run an analysis using ipython for parallel execution:
bcbio_nextgen.py bcbio_sample.yaml -t ipython -n 12 -s lsf -q queue
The -s flag specifies a type of scheduler to use (lsf, sge, torque, slurm, pbspro).
The -q flag specifies the queue to submit jobs to.
The -n flag defines the total number of cores to use on the cluster during processing. The framework will select the appropriate number of cores and type of cluster (single core versus multi-core) to use based on the pipeline stage (see the internals-parallel section in the internals documentation for more details). For multiple core steps, the number of cores to use for programs like bwa, novoalign and gatk comes from the config-resources section of the configuration. Ensure the cores specification matches the physical cores available on machines in your cluster, and the pipeline will divide the total cores specified by -n into the appropriate number of multicore jobs to run.
The pipeline default parameters assume a system with minimal time to obtain processing cores and consistent file system accessibility. These defaults allow the system to fail fast in the case of cluster issues which need diagnosis. For running on shared systems with high resource usage and potential failures due to intermittent cluster issues, there are turning parameters that increase resiliency. The --timeout flag specifies the numbers of minutes to wait for a cluster to start up before timing out. This defaults to 15 minutes. The --retries flag specify the number of times to retry a job on failure. In systems with transient distributed file system hiccups like lock errors or disk availability, this will provide recoverability at the cost of resubmitting jobs that may have failed for reproducible reasons.
Finally, the -r resources flag specifies resource options to pass along to the underlying queue scheduler. This currently supports SGE's -l parameter, Torque's -l parameter and LSF and SLURM native flags. This allows specification or resources to the scheduler (see the qsub man page). You may specify multiple resources, so -r mem=4g -r ct=01:40:00 translates to -l mem=4g -l ct=01:40:00 when passed to qsub or -r "account=a2010002" -r "timelimit=04:00:00" when using SLURM, for instance. SLURM and Torque support specification of an account parameter with -r account=your_name, which IPython transfers into -A.
SGE supports special parameters passed using resources to help handle the heterogeneity of possible setups.
Specify an SGE parallel environment that supports using multiple cores on a single node with -r pename=your_pe. Since this setup is system specific it is hard to write general code for find a suitable environment. Specifically, when there are multiple usable parallel environments, it will select the first one which may not be correct. Manually specifying it with a pename= flag to resources will ensure correct selection of the right environment. If you're administering a grid engine cluster and not sure how to set this up you'd typically want a smp queue using allocation_rule: $pe_slots like in this example pename configuration or smp template.
SGE has other specific flags you may want to tune, depending on your setup. To specify an advanced reservation with the -ar flag, use -r ar=ar_id. To specify an alternative memory management model instead of mem_free use -r memtype=approach. It is further recommended to configure mem_free (or any other chosen memory management model) as a consumable, requestable resource in SGE to prevent overfilling hosts that do not have sufficient memory per slot. This can be done in two steps. First, launch qmon as an admin, select Complex Configuration in qmon, click on mem_free`, under the ``Consumable dialog select JOB (instead of YES or NO) and finally click Modify for the changes to take effect. Secondly, for each host in the queue, configure mem_free as a complex value. If a host called myngshost has 128GB of RAM, the corresponding command would be qconf -mattr exechost complex_values mem_free=128G myngshost
There are also special -r resources parameters to support pipeline configuration:
Parallel jobs can often terminate with rather generic failures like any of the following:
These errors unfortunately don't help diagnose the problem, and you'll likely see the actual error triggering this generic exception earlier in the run. This error can often be hard to find due to parallelization.
If you run into a confusing failure like this, the best approach is to re-run with a single core:
bcbio_nextgen.py your_input.yaml -n 1
which should produce a more helpful debug message right above the failure.
It's also worth re-trying the failed command line outside of bcbio to look for errors. You can find the failing command by cross-referencing the error message with command lines in log/bcbio-nextgen-commands.log. You may have to change temporary directories (tx/tmp**) in some of the job outputs. Reproducing the error outside of bcbio is a good first step to diagnosing and fixing the underlying issue.
This may occure if the current execution is a re-run of a previous project:
Networking problems on clusters can prevent the IPython parallelization framework from working properly. Be sure that the compute nodes on your cluster are aware of IP addresses that they can use to communicate with each other (usually these will be local IP addresses). Running:
python -c 'import socket; print socket.gethostbyname(socket.gethostname())'
Should return such an IP address (as opposed to localhost). This can be fixed by adding an entry to the hosts file.
The line:
host-ip hostname
where host-ip is replaced by the actual IP address of the machine and hostname by the machine's own hostname, should be aded to /etc/hosts on each compute node. This will probably involve contacting your local cluster administrator.
The memory information specified in the system configuration config-resources enables scheduling of memory intensive processes. The values are specified on a memory-per-core basis and thus bcbio-nextgen handles memory scheduling by:
As a result of these calculations, the cores used during processing will not always correspond to the maximum cores provided in the input -n parameter. The goal is rather to intelligently maximize cores and memory while staying within system resources. Note that memory specifications are for a single core, and the pipeline takes care of adjusting this to actual cores used during processing.
bcbio automatically tries to determine the total available memory and cores per machine for balancing resource usage. For multicore runs, it retrieves total memory from the current machine. For parallel runs, it spawns a job on the queue and extracts the system information from that machine. This expects a homogeneous set of machines within a cluster queue. You can see the determined cores and total memory in provenance/system-ipython-queue.yaml.
For heterogeneous clusters or other cases where bcbio does not correctly identify available system resources, you can manually set the machine cores and total memory in the resource section of either a sample YAML file (sample-resources) or bcbio_system.yaml:
resources:
machine:
memory: 48.0
cores: 16
The memory usage is total available on the machine in Gb, so this specifies that individual machines have 48Gb of total memory and 16 cores.
bcbio-nextgen scales out on clusters including hundreds of cores and is stress tested on systems with 1000 simultaneous processes. Scaling up often requires system specific tuning to handle simultaneous processes. This section collects useful tips and tricks for managing scaling issues.
A common failure mode is having too many open file handles. This error report can come from the IPython infrastructure logs as ZeroMQ attempts to open sockets, or from the processing logs as third party software gets file handles. You can check your available file handles with ulimit -a | grep open. Setting open file handle limits is open system and cluster specific and below are tips for specific setups.
In addition to open file handle limits (ulimit -n) large processes may also run into issues with available max user processes (ulimit -u). Some systems set a low soft limit (ulimit -Su) like 1024 but a higher hard limit (ulimit -Hu), allowing adjustment without root privileges. The IPython controllers and engines do this automatically, but the main bcbio_nextgen.py driver process cannot. If this scheduler puts this process on the same node as worker processes, you may run into open file handle limits due to work happening on the workers. To fix this, manually set ulimit -u a_high_number as part of the submission process for the main process.
For a Ubuntu system, edit /etc/security/limits.conf to set the soft and hard nofile descriptors, and edit /etc/pam.d/common-session to add pam_limits.so. See this blog post for more details.
For CentOS/RedHat systems, edit /etc/security/limits.conf and /etc/security/limits.d/90-nproc.conf to increase maximum open files and user limits.
SGE needs configuration at the qmaster level. Invoke qconf -mconf from a host with admin privileges, and edit execd_params:
execd_params S_DESCRIPTORS=20000
bcbio-nextgen makes use of distributed network file systems to manage sharing large files between compute nodes. While we strive to minimize disk-based processing by making use of pipes, the pipeline still has a major IO component. To help manage IO and network bottlenecks, this section contains pointers on deployments and benchmarking. Please contribute your tips and thoughts.
bcbio runs with Common Workflow Language (CWL) compatible parallelization software. bcbio generates a CWL workflow from a standard bcbio sample YAML description file and any tool that supports CWL input can run the workflow. CWL-based tools do the work of managing files and workflows, and bcbio performs the biological analysis using either a Docker container or a local installation.
bcbio creates CWL for alignment, small variant calls (SNPs and indels), coverage assessment, HLA typing, quality control and structural variant calling. It generates a CWL v1.0.2 compatible workflow. The actual biological code execution during runs works with either a bcbio docker container or a local installation of bcbio.
The implementation includes bcbio's approaches to splitting and batching analyses. At the top level workflow, we parallelize by samples. Using sub-workflows, we split fastq inputs into sections for parallel alignment over multiple machines following by merging. We also use sub-workflows, along with CWL records, to batch multiple samples and run in parallel. This enables pooled and tumor/normal cancer calling with parallelization by chromosome regions based on coverage calculations.
bcbio supports these CWL-compatible tools:
We plan to continue to expand CWL support to include more components of bcbio, and also need to evaluate the workflow on larger, real life analyses. This includes supporting additional CWL runners. We're working on evaluating Galaxy/Planemo for integration with the Galaxy community.
bcbio-vm installs all dependencies required to generate CWL and run bcbio, along with supported CWL runners. There are two install choices, depending on your usage of bcbio: running CWL with a existing local bcbio install, or running with containers.
To run bcbio without using containers, first install bcbio and make it available in your path. You'll need both the bcbio code and tools. To only run the tests and bcbio validations, you don't need a full data installation so can install with --nodata.
To then install bcbio-vm, add the --cwl flag to the install:
bcbio_nextgen.py upgrade --cwl
Adding this to any future upgrades will also update the bcbio-vm wrapper code and tools.
When you begin running your own analysis and need the data available, pre-prepare your bcbio data directory with bcbio_nextgen.py upgrade --data --cwl.
If you don't have an existing local bcbio installation and want to run with CWL using the tools and data embedded in containers, you can do a stand along install of just bcbio-vm. To install using Miniconda and bioconda packages on Linux:
wget http://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh export TARGETDIR=~/install/bcbio-vm/anaconda export BINDIR=/usr/local bash Miniconda2-latest-Linux-x86_64.sh -b -p $TARGETDIR $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda bcbio-nextgen $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda bcbio-nextgen-vm ln -s $TARGETDIR/bin/bcbio_vm.py $BINDIR/bcbio_vm.py ln -s $TARGETDIR/bin/conda $BINDIR/bcbiovm_conda ln -s $TARGETDIR/bin/python $BINDIR/bcbiovm_python
In the above commands, the bcbio-vm install goes in $TARGETDIR. The example is in your home directory but set to anywhere you have space. Also, as an alternative to symbolic linking to a $BINDIR, you can add the install bin directory to your PATH:
export PATH=$TARGETDIR/bin:$PATH
This install includes bcbio-nextgen libraries, used in generating CWL and orchestrating runs, but is not a full bcbio installation. It requires Docker present on your system this is all you need to get started running examples, since the CWL runners will pull in Docker containers with the bcbio tools.
To make it easy to get started, we have pre-built CWL descriptions that use test data. These run in under 5 minutes on a local machine and don't require a bcbio installation if you have Docker available on your machine:
wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz tar -xzvpf test_bcbio_cwl.tar.gz cd test_bcbio_cwl-master/somatic
bash run_cromwell.sh bash run_bunny.sh bash run_toil.sh
Or you can run directly using the bcbio_vm.py wrappers:
bcbio_vm.py cwlrun cromwell somatic-workflow bcbio_vm.py cwlrun toil somatic-workflow bcbio_vm.py cwlrun bunny somatic-workflow
Thes wrappers automatically handle temporary directories, permissions, logging and re-starts. If running without Docker, use a local installation of bcbio add --no-container to the commands in the shell scripts.
The first step in running your analysis project in bcbio is to generate CWL. If you're already familiar with bcbio, the process of preparing information about your sample inputs and analysis are almost identical:
In addition to resources specifications, the bcbio system file now also includes paths to the reference biodata and optionally input file directories if you want to avoid specifying full paths to your inputs in the bcbio_vm.py template command. bcbio will recursively look up file locations within those inputs, and this has the advantage of working identically for non-local file locations. Here is an example for a 16 core machine with 3.5Gb of memory per core:
local:
ref: /path/to/bcbio/genomes/Hsapiens
inputs:
- /path/to/input/files resources:
default:
cores: 16
memory: 3500M
jvm_opts: [-Xms1g, -Xmx3500m]
Generate CWL with:
bcbio_vm.py template --systemconfig bcbio_system.yaml template.yaml samples.csv [optional list of fastq or BAM inputs] bcbio_vm.py cwl --systemconfig bcbio_system.yaml samples/config/samples.yaml
producing a sample-workflow output directory with the CWL.
On a first CWL generation run with a new genome, this process will run for a longer time as it needs to make your reference compatible with CWL. This includes creating single tar.gz files from some reference directories so they can get passed to CWL steps where they'll get unpacked. This process only happens a single time and keeps unpacked versions so your reference setup is compatible with both old bcbio IPython and new CWL runs.
You can now run this with any CWL compatible runner and the bcbio_vm.py cwlrun wrappers standardize running across multiple tools in different environments.
The Cromwell workflow management system runs bcbio either locally on a single machine or distributed on a cluster using a scheduler like SLURM, SGE or PBSPro.
To run a bcbio CWL workflow locally using Docker:
bcbio_vm.py cwlrun cromwell sample-workflow
If you want to run from a locally installed bcbio add --no-container to the commandline.
To run distributed on a SLURM cluster:
bcbio_vm.py cwlrun cromwell sample-workflow --no-container -q your_queue -s slurm -r timelimit=0-12:00
Tweak scheduler parameters using the same options as the older bcbio IPython approach.
To control the resources used Cromwell, set --joblimit to the allowed jobs allocated concurrently. This isn't total cores used, but rather the number of jobs either locally or remotely scheduled concurrently. Since CWL steps are heterogeneous and use only cores necessary for that job, the total cores used will max out at joblimit times maximum cores for an individual process. Setting this helps avoid over-committing jobs to a shared scheduler during highly parallel processes like variant calling.
Cromwell can also run directly on cloud resources: docs-cloud-gcp.
The Toil pipeline management system runs CWL workflows in parallel on a local machine, on a cluster or at AWS.
To run a bcbio CWL workflow locally with Toil using Docker:
bcbio_vm.py cwlrun toil sample-workflow
If you want to run from a locally installed bcbio add --no-container to the commandline.
To run distributed on a Slurm cluster:
bcbio_vm.py cwlrun toil sample-workflow -- --batchSystem slurm
bcbio generated CWL workflows run on Arvados and these instructions detail how to run on the Arvdos public instance. Arvados cwl-runner comes pre-installed with bcbio-vm. We have a publicly accessible project, called bcbio_resources that contains the latest Docker images, test data and genome references you can use for runs.
Retrieve API keys from the Arvados public instance. Login, then go to 'User Icon -> Personal Token'. Copy and paste the commands given there into your shell. You'll specifically need to set ARVADOS_API_HOST and ARVADOS_API_TOKEN.
To run an analysis:
arv-put --name testdata_genomes --project-uuid $PROJECT_ID testdata/genomes/hg19
arv-put --name testdata_inputs --project-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material
arvados:
reference: qr1hi-4zz18-kuz1izsj3wkfisq
input: [qr1hi-j7d0g-h691y6104tlg8b4] resources:
default: {cores: 4, memory: 2G, jvm_opts: [-Xms750m, -Xmx2500m]}
bcbio_vm.py template --systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv
To generate the CWL from the system and sample configuration files:
bcbio_vm.py cwl --systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml
arv-copy $JOBS_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi arv-copy $BCBIO_VC_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi
or import local Docker images to your Arvados project:
docker pull arvados/jobs:1.0.20180216164101 arv-keepdocker --project $PROJECT_ID -- arvados/jobs 1.0.20180216164101 docker pull quay.io/bcbio/bcbio-vc arv-keepdocker --project $PROJECT_ID -- quay.io/bcbio/bcbio-vc latest
bcbio_vm.py cwlrun arvados arvados_testcwl-workflow -- --project-uuid $PROJECT_ID
bcbio runs on the DNAnexus platform by converting bcbio generated CWL into DNAnexus workflows and apps using dx-cwl. This describes the process using the bcbio workflow app (bcbio-run-workflow) and bcbio workflow applet (bcbio_resources:/applets/bcbio-run-workflow) in the public bcbio_resources project, Both are regularly updated and maintained on the DNAnexus platform. Secondarily, we also show how to install and create workflows locally for additional control and debugging.
dx new project $PNAME
dx select $PNAME dx upload -p --path /data/input *.bam
dnanexus:
project: PNAME
ref:
project: bcbio_resources
folder: /reference_genomes
inputs:
- /data/input
- /data/input/regions resources:
default: {cores: 8, memory: 3000M, jvm_opts: [-Xms1g, -Xmx3000m]}
samplename,description,batch,phenotype file1.bam,sample1,b1,tumor file2.bam,sample2,b1,normal file3.bam,sample3,b2,tumor file4.bam,sample4,b2,normal
details:
- algorithm:
aligner: bwa
variantcaller: gatk-haplotype
analysis: variant2
genome_build: hg38
TEMPLATE=germline APP_VERSION=0.0.2 FOLDER=/bcbio/$PNAME dx select "$PROJECT" dx mkdir -p $FOLDER for F in $TEMPLATE-template.yaml $PNAME.csv bcbio_system-dnanexus.yaml do
dx rm -a /$FOLDER/$F || true
dx upload --path /$FOLDER/ $F done dx ls $FOLDER dx rm -a -r /$FOLDER/dx-cwl-run || true dx run bcbio-run-workflow/$APP_VERSION -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run
Alternatively if you want the latest bcbio code, change the final command to use the applet. Everything else in the script is identical:
dx run bcbio_resources:/applets/bcbio-run-workflow -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run
The app will lookup all files, prepare a bcbio CWL workflow, convert into a DNAnexus workflow, and submit to the platform. The workflow runs as a standard DNAnexus workflow and you can monitor through the command line (with dx find executions --root job-YOURJOBID and dx watch) or the web interface (Monitor tab).
If you prefer not to use the DNAnexus app, you can also submit jobs locally by installing bcbio-vm on your local machine. This can also be useful to test generation of CWL and manually ensure identification of all your samples and associated files on the DNAnexus platform.
TEMPLATE=germline rm -rf $PNAME $PNAME-workflow bcbio_vm.py template --systemconfig bcbio_system-dnanexus.yaml $TEMPLATE-template.yaml $PNAME.csv bcbio_vm.py cwl --systemconfig bcbio_system-dnanexus.yaml $PNAME/config/$PNAME.yaml
dx env
dx-cwl compile-workflow $PNAME-workflow/main-$PNAME.cwl \
--project PROJECT_ID --token $DX_AUTH_TOKEN \
--rootdir $FOLDER/dx-cwl-run
FOLDER=/bcbio/$PNAME dx mkdir -p $DX_PROJECT_ID:$FOLDER/$PNAME-workflow dx upload -p --path $DX_PROJECT_ID:$FOLDER/$PNAME-workflow $PNAME-workflow/main-$PNAME-samples.json dx-cwl run-workflow $FOLDER/dx-cwl-run/main-$PNAME/main-$PNAME \
$FOLDER/$PNAME-workflow/main-$PNAME-samples.json \
--project PROJECT_ID --token $DX_AUTH_TOKEN \
--rootdir $FOLDER/dx-cwl-run
bcbio generates a common workflow language description. Internally, bcbio represents the files and information related to processing as a comprehensive dictionary. This world object describes the state of a run and associated files, and new processing steps update or add information to it. The world object is roughly equivalent to CWL's JSON-based input object, but CWL enforces additional annotations to identify files and models new inputs/outputs at each step. The work in bcbio is to move from our laissez-faire approach to the more structured CWL model.
The generated CWL workflow is in run_info-cwl-workflow:
To help with defining the outputs at each step, there is a WorldWatcher object that can output changed files and world dictionary objects between steps in the pipeline when running a bcbio in the standard way. The variant pipeline has examples using it. This is useful when preparing the CWL definitions of inputs and outputs for new steps in the bcbio CWL step definitions.
bcbio has two approaches to running on cloud providers like Amazon Web Services (AWS), Google Cloud (GCP) and Microsoft Azure. For smaller projects we use a simplified ansible based approach which automates spinning up single multicore machines for running either traditional or docs-cwl bcbio runs.
For larger distributed projects, we're actively working on using docs-cwl support with runners like Cromwell that directly interface and run on cloud services. We'll document these approaches here as they're tested and available.
For getting started, the CWL docs-cwl-installation documentation describes how to install bcbio-vm, which provides a wrapper around bcbio that automates interaction with cloud providers and Docker. bcbio_vm.py also cleans up the command line usage to make it more intuitive and provides a superset of functionality available in bcbio_nextgen.py.
Cromwell runs bcbio CWL pipelines on Google Cloud using the Google Pipelines API.
To setup a Google Compute environment, you'll make use of the Web based console and gcloud and gsutil from the Google Cloud SDK, which provide command line interfacts to manage data in Google Storage and Google Compute instances. You can install with:
bcbio_conda install -c conda-forge -c bioconda google-cloud-sdk
For authentication, you want to set up a Google Cloud Platform service account. The environmental variable GOOGLE_APPLICATION_CREDENTIALS identifies a JSON file of credentials. which bcbio passes to Cromwell for authentication:
gcloud auth login gcloud projects create your-project gcloud iam service-accounts create your-service-account gcloud projects add-iam-policy-binding your-project --member \
"serviceAccount:your-service-account@your-project.iam.gserviceaccount.com" --role "roles/owner" gcloud iam service-accounts keys create ~/.config/glcoud/your-service-account.json \
--iam-account your-service-account@your-project.iam.gserviceaccount.com export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/your-service-account.json
You'll need a project for your run along with a Google Storage bucket for your data and run intermediates:
gcloud config set project your-project gsutil mb gs://your-project
Additional documentation for Cromwell: Google Pipelines API and Google authentication.
Cromwell can localize data present in Google Storage buckets as part of the run process and bcbio will translate the data present in these storage bucket into references for the CWL run inputs.
Upload your data with gsutil:
gsutil cp your_data.bam gs://your-project/inputs/
Create a bcbio_system-gcp.yaml input file for docs-cwl-generate:
gs:
ref: gs://bcbiodata/collections
inputs:
- gs://your-project/inputs resources:
default: {cores: 2, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}
Generate a Common Workflow Language representation:
bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv bcbio_vm.py cwl --systemconfig bcbio_system-gcp.yaml $PNAME/config/$PNAME.yaml
Run the CWL using Cromwell by specifying the project and root Google Storage bucket for intermediates:
bcbio_vm.py cwlrun cromwell $PNAME --cloud-project your-project \
--cloud-root gs://your-project/work_cromwell
Amazon Web Services (AWS) provides a flexible cloud based environment for running analyses. Cloud approaches offer the ability to perform analyses at scale with no investment in local hardware. They also offer full programmatic control over the environment, allowing bcbio to automate the entire setup, run and teardown process.
bcbio-vm uses Elasticluster to build a cluster on AWS with an encrypted NFS mounted drive and an optional Lustre shared filesystem. We're phasing out this approach to cloud support in bcbio and will be actively moving to Common Workflow Language based approaches.
The easiest way to organize AWS projects is using an analysis folder inside an S3 bucket. Create a bucket and folder for your analysis and upload fastq, BAM and, optionally, a region BED file. Bucket names should include only lowercase letters, numbers and hyphens (-) to conform to S3 bucket naming restrictions and avoid issues with resolution of SSL keys. You can create buckets and upload files using the AWS S3 web console, the AWS cli client or specialized tools like gof3r.
You will also need a template file describing the type of run to do and a CSV file mapping samples in the bucket to names and any other metadata. See the automated-sample-config docs for more details about these files. Also upload both of these files to S3.
With that in place, prepare and upload the final configuration to S3 with:
bcbio_vm.py template s3://your-project/your-analysis/template.yaml s3://your-project/your-analysis/name.csv
This will find the input files in the s3://your-project/your-analysis bucket, associate fastq and BAM files with the right samples, and add a found BED files as variant_regions in the configuration. It will then upload the final configuration back to S3 as s3://your-project/your-analysis/name.yaml, which you can run directly from a bcbio cluster on AWS. By default, bcbio will use the us-east S3 region, but you can specify a different region in the s3 path to the metadata file: s3://your-project@eu-central-1/your-analysis/name.csv
We currently support human analysis with both the GRCh37 and hg19 genomes. We can also add additional genomes as needed by the community and generally welcome feedback and comments on reference data support.
We're not able to automatically install some useful tools in pre-built docker containers due to licensing restrictions. Variant calling with GATK requires a manual download from the GATK download site for academic users. Commercial users need a license for GATK and for somatic calling with muTect. To make these jars available, upload them to the S3 bucket in a jars directory. bcbio will automatically include the correct GATK and muTect directives during your run. Alternatively, you can also manually specify the path to the jars using a global resources section of your input sample YAML file:
resources:
gatk:
jar: s3://bcbio-syn3-eval/jars/GenomeAnalysisTK.jar
As with sample YAML scripts, specify a different region with an @ in the bucket name: s3://your-project@us-west-2/jars/GenomeAnalysisTK.jar
The first time running bcbio on AWS you'll need to setup permissions, VPCs and local configuration files. We provide commands to automate all these steps and once finished, they can be re-used for subsequent runs. To start you'll need to have an account at Amazon and your Access Key ID and Secret Key ID from the AWS security credentials page. These can be IAM credentials instead of root credentials as long as they have administrator privileges. Make them available to bcbio using the standard environmental variables:
export AWS_ACCESS_KEY_ID=your_access_key export AWS_SECRET_ACCESS_KEY=your_secret_key
With this in place, two commands setup your elasticluster and AWS environment to run a bcbio cluster. The first creates public/private keys, a bcbio IAM user, and sets up an elasticluster config in ~/.bcbio/elasticluster/config:
bcbio_vm.py aws iam --region=us-east-1
The second configures a VPC to host bcbio:
bcbio_vm.py aws vpc --region=us-east-1
The aws vpc command is idempotent and can run multiple times if you change or remove parts of the infrastructure. You can also rerun the aws iam command, but if you'd like to generate a new elasticluster configuration file (~/.bcbio/elasticluster/config) add the recreate flag: bcbio_vm.py aws iam --recreate. This generates a new set of IAM credentials and public/private keys. These are only stored in the ~/.bcbio directory so you need to fully recreate them if you delete the old ones.
Following this setup, you're ready to run a bcbio cluster on AWS. We start from a standard Ubuntu AMI, installing all software for bcbio and the cluster as part of the boot process.
To configure your cluster run:
bcbio_vm.py aws config edit
This dialog allows you to define the cluster size and machine resources you'd like to use. The defaults only have small instances to prevent accidentally starting an expensive run. If you're planning a run with less than 32 cores, do not use a cluster and instead run directly on a single machine using one of the large r3 or c3 instances.
This script also sets the size of the encrypted NFS-mounted drive, which you can use to store processing data when running across a distributed cluster. At scale, you can replace this with a Lustre shared filesystem. See below for details on launching and attaching a Lustre filesystem to a cluster.
To ensure everything is correctly configured, run:
bcbio_vm.py aws info
When happy with your setup, start the cluster with:
bcbio_vm.py aws cluster start
The cluster will take five to ten minutes to start and be provisioned. If you encounter any intermittent failures, you can rerun the cluster configuration step with bcbio_vm.py aws cluster setup or the bcbio-specific installation with bcbio_vm.py aws cluster bootstrap.
Elasticluster mounts the /encrypted directory as a NFS share available across all of the worker machines. You can use this as a processing directory for smaller runs but for larger runs may need a scalable distributed file system. bcbio supports using Intel Cloud Edition for Lustre (ICEL) to set up a Lustre scratch filesystem on AWS.
bcbio_vm.py aws icel create
If you encounter any intermittent failures when installing collectl plugin, that means lustre server is created successfully, you can rerun the lustre configuration step with bcbio_vm.py aws icel create --setup. If you had any failure creating the lustre server before the collectl plugin installation, you should stop it, and try again.
bcbio_vm.py aws icel mount
To run the analysis, connect to the head node with:
bcbio_vm.py aws cluster ssh
Create your project directory and link the global bcbio configuration file in there with:
mkdir /encrypted/your-project cd !$ && mkdir work && cd work
sudo mkdir /scratch/cancer-dream-syn3-exome sudo chown ubuntu !$ cd !$ && mkdir work && cd work
If you started a single machine, run with:
bcbio_vm.py run -n 8 s3://your-project/your-analysis/name.yaml
Where the -n argument should be the number of cores on the machine.
To run on a full cluster:
bcbio_vm.py ipythonprep s3://your-project/your-analysis/name.yaml slurm cloud -n 60 sbatch bcbio_submit.sh
Where 60 is the total number of cores to use across all the worker nodes. Of your total machine cores, allocate 2 for the base bcbio_vm script and IPython controller instances. The SLURM workload manager distributes jobs across your cluster on a queue called cloud. A slurm-PID.out file in the work directory contains the current status of the job, and sacct_std provides the status of jobs on the cluster. If you are new to SLURM, here is a summary of useful SLURM commands.
On successful completion, bcbio uploads the results of the analysis back into your s3 bucket and folder as s3://your-project/your-analysis/final. You can now cleanup the cluster and Lustre filesystem.
AWS runs include automatic monitoring of resource usage with collectl. bcbio_vm uses collectl statistics to plot CPU, memory, disk and network usage during each step of a run. To prepare resource usage plots after finishing an analysis, first copy the bcbio-nextgen.log file to your local computer. Either use bcbio_vm.py elasticluster sftp bcbio to copy from the work directory on AWS (/encrypted/your-project/work/log/bcbio-nextgen.log) or transfer it from the output S3 bucket (your-project/your-analysis/final/DATE_your-project/bcbio-nextgen.log).
If your run worked cleanly you can use the log input file directly. If you had failures and restarts, or would only like to graph part of the run, you can edit the timing steps. Run grep Timing bcbio-nextgen.log > your-run.txt to get the timing steps only, then edit as desired.
Retrieve the collectl statistics from the AWS cluster and prepare the resource usage graphs with:
bcbio_vm.py graph bcbio-nextgen.log
By default the collectl stats will be in monitoring/collectl and plots in monitoring/graphs based on the above log timeframe. If you need to re-run plots later after shutting the cluster down, you can use the none cluster flag by running bcbio_vm.py graph bcbio-nextgen.log --cluster none.
If you'd like to run graphing from a local non-AWS run, such as a local HPC cluster, run bcbio_vm.py graph bcbio-nextgen.log --cluster local instead.
For convenience, there's a "serialize" flag ('-s') that saves the dataframe used for plotting. In order to explore the data and extract specific datapoints or zoom, one could just deserialize the ouput like a python pickle file:
`
import cPickle as pickle with gzip.open("./monitoring/collectl_info.pickle.gz", "rb") as decomp:
``
`
And plot, slice, zoom it in an jupyter notebook using matplotlib, [highcharts](https://github.com/arnoutaertgeerts/python-highcharts).
In addition to plots, the summarize_timing.py utility script prepares a summary table of run times per step.
The bcbio Elasticluster and Lustre integration can spin up a lot of AWS resources. You'll be paying for these by the hour so you want to clean them up when you finish running your analysis. To stop the cluster:
bcbio_vm.py aws cluster stop
To remove the Lustre stack:
bcbio_vm.py aws icel stop
Double check that all instances have been properly stopped by looking in the AWS console.
Experienced elasticluster users can edit the configuration files themselves. bcbio provides a small wrapper that automatically reads and writes these configurations to avoid users needing to understand elasticluster internals, but all functionality is fully available. Edit your ~/.bcbio/elasticluster/config file to change parameters. You can also see the latest example configuration. in the bcbio-vm GitHub repository for more details on the other available options.
bcbio-nextgen runs in a temporary work directory which contains a number of processing intermediates. Pipeline completion extracts the final useful output files into a separate directory, specified by the upload-configuration. This configuration allows upload to local directories, Galaxy, or Amazon S3. Once extracting and confirming the output files, you can delete the temporary directory to save space.
The output directory contains sample specific output files labeled by sample name and a more general project directory. The sample directories contain all of the sample specific output files, while the project directory contains global files like project summaries or batched population level variant calls. See the teaching documentation for a full variant calling example with additional details about configuration setting and resulting output files.
This section collects useful scripts and tools to do downstream analysis of bcbio-nextgen outputs. If you have pointers to useful tools, please add them to the documentation.
This section provides useful concepts for getting started digging into the code and contributing new functionality. We welcome contributors and hope these notes help make it easier to get started.
During development we seek to maximize functionality and usefulness, while avoiding complexity. Since these goals are sometimes in conflict, it's useful to understand the design approaches:
The most useful modules inside bcbio, ordered by likely interest:
bcbio-nextgen uses GitHub for code development, and we welcome pull requests. GitHub makes it easy to establish custom forks of the code and contribute those back. The Biopython documentation has great information on using git and GitHub for a community developed project. In short, make a fork of the bcbio code by clicking the Fork button in the upper right corner of the GitHub page, commit your changes to this custom fork and keep it up to date with the main bcbio repository as you develop. The github help pages have detailed information on keeping your fork updated with the main github repository (e.g. https://help.github.com/articles/syncing-a-fork/). After commiting changes, click New Pull Request from your fork when you'd like to submit your changes for integration in bcbio.
For developing and testing changes locally, you can install directly into a bcbio-nextgen installation. The automated bcbio-nextgen installer creates an isolated Python environment using Anaconda. This will be a subdirectory of your installation root, like /usr/local/share/bcbio-nextgen/anaconda. The installer also includes a bcbio_python executable target which is the python in this isolated anaconda directory. You generally will want to make changes to your local copy of the bcbio-nextgen code and then install these into the code directory. To install from your bcbio-nextgen source tree for testing do:
bcbio_python setup.py install
One tricky part that we don't yet know how to work around is that pip and standard setup.py install have different ideas about how to write Python eggs. setup.py install will create an isolated python egg directory like bcbio_nextgen-0.7.5a-py2.7.egg, while pip creates an egg pointing to a top level bcbio directory. Where this gets tricky is that the top level bcbio directory takes precedence. The best way to work around this problem is to manually remove the current pip installed bcbio-nextgen code (rm -rf /path/to/anaconda/lib/python2.7/site-packages/bcbio*) before managing it manually with bcbio_python setup.py install. We'd welcome tips about ways to force consistent installation across methods.
If you want to test with bcbio_nextgen code in a separate environment from your work directory, we recommend using the installer to install only the bcbio code into a separate directory:
python bcbio_nextgen_install.py /path/to/testbcbio --nodata --isolate
Then add this directory to your PATH before your bcbio installation with the tools: export PATH=/path/to/testbcbio/anaconda/bin:$PATH, or directly calling the testing bcbio /path/to/testbcbio/anaconda/bin/bcbio_nextgen.py.
If you have added or modified this documentation, to build it locally and see how it looks like you can do so by running:
cd docs make html
The documentation will be built under docs/_build/html, open index.html with your browser to load your local build.
If you want to use the same theme that Read The Docs uses, you can do so by installing sphinx_rtd_theme via pip. You will also need to add this in the docs/conf.py file to use the theme only locally:
html_theme = 'default' on_rtd = os.environ.get('READTHEDOCS', False) if not on_rtd: # only import and set the theme if we're building docs locally
import sphinx_rtd_theme
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
Write new aligners within their own submodule inside the ngsalign directory. bwa.py is a good example to follow along with. There are two functions to implement, based on which type of alignment you'd like to allow:
See the names section for more details on arguments.
Other required implementation details include:
Once implemented, plug the aligner into the pipeline by defining it as a _tool in bcbio/pipeline/alignment.py. You can then use it as normal by specifying the name of the aligner in the aligner section of your configuration input.
New variant calling approaches live within their own module inside bcbio/variation. The freebayes.py implementation is a good example to follow for providing your own variant caller. Implement a function to run variant calling on multiple BAMs in an input region that takes the following inputs:
Once implemented, add the variant caller into the pipeline by updating caller_fns in the variantcall_sample function in bcbio/variation/genotype.py. You can use it by specifying it in the variantcaller parameter of your sample configuration.
While bcbio-nextgen and supporting tools receive the most testing and development on human or human-like diploid organisms, the algorithms are generic and we strive to support the wide diversity of organisms used in your research. We welcome contributors interested in setting up and maintaining support for their particular research organism, and this section defines the steps in integrating a new genome. We also welcome suggestions and implementations that improve this process.
Setup CloudBioLinux to automatically download and prepare the genome:
Add the organism to the supported installs within bcbio:
Test installation of genomes by pointing to your local cloudbiolinux edits during a data installation:
mkdir -p tmpbcbio-install ln -s ~/bio/cloudbiolinux tmpbcbio-install bcbio_nextgen.py upgrade --data --genomes DBKEY
Add configuration information to bcbio-nextgen by creating a config/genomes/DBKEY-resources.yaml file. Copy an existing minimal template like canFam3 and edit with pointers to snpEff and other genome resources. The VEP database directory has Ensembl names. SnpEff has a command to list available databases:
snpEff databases
Finally, send pull requests for CloudBioLinux and bcbio-nextgen and we'll happily integrate the new genome.
This will provide basic integration with bcbio and allow running a minimal pipeline with alignment and quality control. We also have utility scripts in CloudBioLinux to help with preparing dbSNP (utils/prepare_dbsnp.py) and RNA-seq (utils/prepare_tx_gff.py) resources for some genomes. For instance, to prepare RNA-seq transcripts for mm9:
bcbio_python prepare_tx_gff.py --genome-dir /path/to/bcbio/genomes Mmusculus mm9
We are still working on ways to best include these as part of the standard build and install since they either require additional tools to run locally, or require preparing copies in S3 buckets.
This dictionary provides lane and other BAM run group naming information used to correctly build BAM files. We use the rg attribute as the ID within a BAM file:
{'lane': '7_100326_FC6107FAAXX',
'pl': 'illumina',
'pu': '7_100326_FC6107FAAXX',
'rg': '7',
'sample': 'Test1'}
The data dictionary is a large dictionary representing processing, configuration and files associated with a sample. The standard work flow is to pass this dictionary between functions, updating with associated files from the additional processing. Populating this dictionary only with standard types allows serialization to JSON for distributed processing.
The dictionary is dynamic throughout the workflow depending on the step, but some of the most useful key/values available throughout are:
It also contains information the genome build, sample name and reference genome file throughout. Here's an example of these inputs:
{'config': {'algorithm': {'aligner': 'bwa',
'callable_regions': 'analysis_blocks.bed',
'coverage_depth': 'low',
'coverage_interval': 'regional',
'mark_duplicates': 'samtools',
'nomap_split_size': 50,
'nomap_split_targets': 20,
'num_cores': 1,
'platform': 'illumina',
'quality_format': 'Standard',
'realign': 'gkno',
'recalibrate': 'gatk',
'save_diskspace': True,
'upload_fastq': False,
'validate': '../reference_material/7_100326_FC6107FAAXX-grade.vcf',
'variant_regions': '../data/automated/variant_regions-bam.bed',
'variantcaller': 'freebayes'},
'resources': {'bcbio_variation': {'dir': '/usr/share/java/bcbio_variation'},
'bowtie': {'cores': None},
'bwa': {'cores': 4},
'cortex': {'dir': '~/install/CORTEX_release_v1.0.5.14'},
'cram': {'dir': '/usr/share/java/cram'},
'gatk': {'cores': 2,
'dir': '/usr/share/java/gatk',
'jvm_opts': ['-Xms750m', '-Xmx2000m'],
'version': '2.4-9-g532efad'},
'gemini': {'cores': 4},
'novoalign': {'cores': 4,
'memory': '4G',
'options': ['-o', 'FullNW']},
'picard': {'cores': 1,
'dir': '/usr/share/java/picard'},
'snpEff': {'dir': '/usr/share/java/snpeff',
'jvm_opts': ['-Xms750m', '-Xmx3g']},
'stampy': {'dir': '~/install/stampy-1.0.18'},
'tophat': {'cores': None},
'varscan': {'dir': '/usr/share/java/varscan'},
'vcftools': {'dir': '~/install/vcftools_0.1.9'}}}, 'genome_resources': {'aliases': {'ensembl': 'human',
'human': True,
'snpeff': 'hg19'},
'rnaseq': {'transcripts': '/path/to/rnaseq/ref-transcripts.gtf',
'transcripts_mask': '/path/to/rnaseq/ref-transcripts-mask.gtf'},
'variation': {'dbsnp': '/path/to/variation/dbsnp_132.vcf',
'train_1000g_omni': '/path/to/variation/1000G_omni2.5.vcf',
'train_hapmap': '/path/to/hg19/variation/hapmap_3.3.vcf',
'train_indels': '/path/to/variation/Mills_Devine_2hit.indels.vcf'},
'version': 1},
'dirs': {'fastq': 'input fastq directory',
'galaxy': 'directory with galaxy loc and other files',
'work': 'base work directory'},
'metadata': {'batch': 'TestBatch1'},
'genome_build': 'hg19',
'name': ('', 'Test1'),
'sam_ref': '/path/to/hg19.fa'}
Processing also injects other useful key/value pairs. Here's an example of additional information supplied during a variant calling workflow:
{'prep_recal': 'Test1/7_100326_FC6107FAAXX-sort.grp',
'summary': {'metrics': [('Reference organism', 'hg19', ''),
('Total', '39,172', '76bp paired'),
('Aligned', '39,161', '(100.0\\%)'),
('Pairs aligned', '39,150', '(99.9\\%)'),
('Pair duplicates', '0', '(0.0\\%)'),
('Insert size', '152.2', '+/- 31.4')],
'pdf': '7_100326_FC6107FAAXX-sort-prep-summary.pdf',
'project': 'project-summary.yaml'},
'validate': {'concordant': 'Test1-ref-eval-concordance.vcf',
'discordant': 'Test1-eval-ref-discordance-annotate.vcf',
'grading': 'validate-grading.yaml',
'summary': 'validate-summary.csv'},
'variants': [{'population': {'db': 'gemini/TestBatch1-freebayes.db',
'vcf': None},
'validate': None,
'variantcaller': 'freebayes',
'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf'}],
'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf',
'work_bam': '7_100326_FC6107FAAXX-sort-prep.bam'}
bcbio-nextgen supports parallel runs on local machines using multiple cores and distributed on a cluster using IPython using a general framework.
The first parallelization step starts up a set of resources for processing. On a cluster this spawns a IPython parallel controller and set of engines for processing. The prun (parallel run) start function is the entry point to spawning the cluster and the main argument is a parallel dictionary which contains arguments to the engine processing command. Here is an example input from an IPython parallel run:
{'cores': 12,
'type': 'ipython'
'progs': ['aligner', 'gatk'],
'ensure_mem': {'star': 30, 'tophat': 8, 'tophat2': 8},
'module': 'bcbio.distributed',
'queue': 'batch',
'scheduler': 'torque',
'resources': [],
'retries': 0,
'tag': '',
'timeout': 15}
The cores and type arguments must be present, identifying the total cores to use and type of processing, respectively. Following that are arguments to help identify the resources to use. progs specifies the programs used, here the aligner, which bcbio looks up from the input sample file, and gatk. ensure_mem is an optional argument that specifies minimum memory requirements to programs if used in the workflow. The remaining arguments are all specific to IPython to help it spin up engines on the appropriate computing cluster.
A shared component of all processing runs is the identification of used programs from the progs argument. The run creation process looks up required memory and CPU resources for each program from the config-resources section of your bcbio_system.yaml file. It combines these resources into required memory and cores using the logic described in the memory-management section of the parallel documentation. Passing these requirements to the cluster creation process ensures the available machines match program requirements.
bcbio-nextgen's pipeline.main code contains examples of starting and using set of available processing engines. This example starts up machines that use samtools, gatk and cufflinks then runs an RNA-seq expression analysis:
with prun.start(_wprogs(parallel, ["samtools", "gatk", "cufflinks"]),
samples, config, dirs, "rnaseqcount") as run_parallel:
samples = rnaseq.estimate_expression(samples, run_parallel)
The pipelines often reuse a single set of machines for multiple distributed functions to avoid the overhead of starting up and tearing down machines and clusters.
The run_parallel function returned from the prun.start function enables running on jobs in the parallel on the created machines. The ipython wrapper code contains examples of implementing this. It is a simple function that takes two arguments, the name of the function to run and a set of multiple arguments to pass to that function:
def run(fn_name, items):
The items arguments need to be strings, lists and dictionaries to allow serialization to JSON format. The internals of the run function take care of running all of the code in parallel and returning the results back to the caller function.
In this setup, the main processing code is fully independent from the parallel method used so running on a single multicore machine or in parallel on a cluster return identical results and require no changes to the logical code defining the pipeline.
During re-runs, we avoid the expense of spinning up processing clusters for completed tasks using simple checkpoint files in the checkpoints_parallel directory. The prun.start wrapper writes these on completion of processing for a group of tasks with the same parallel architecture, and on subsequent runs will go through these on the local machine instead of parallelizing. The processing code supports these quick re-runs by checking for and avoiding re-running of tasks when it finds output files.
Plugging new parallelization approaches into this framework involves writing interface code that handles the two steps. First, create a cluster of ready to run machines given the parallel function with expected core and memory utilization:
Second, implement a run_parallel function that handles using these resources to distribute jobs and return results. The multicore wrapper and ipython wrapper are useful starting points for understanding the current implementations.
bcbio calculates callable regions following alignment using goleft depth. These regions determine breakpoints for analysis, allowing parallelization by genomic regions during variant calling. Defining non-callable regions allows bcbio to select breakpoints for parallelization within chromosomes where we won't accidentally impact small variant detection. The callable regions also supplement the variant calls to define positions where not called bases are homozygous reference, as opposed to true no-calls with no evidence. The callable regions plus variant calls is an alternative to gVCF output which explicitly enumerates reference calls in the output variant file.
Feel free to reuse any images or text from these talks. The slides are on GitHub.
Community Development of Validated Variant Calling Pipelines
Brad Chapman, Rory Kirchner, Oliver Hofmann and Winston Hide Harvard School of Public Health, Bioinformatics Core, Boston, MA, 02115
Translational research relies on accurate identification of genomic variants. However, rapidly changing best practice approaches in alignment and variant calling, coupled with large data sizes, make it a challenge to create reliable and reproducible variant calls. Coordinated community development can help overcome these challenges by sharing testing and updates across multiple groups. We describe bcbio-nextgen, a distributed multi-architecture pipeline that automates variant calling, validation and organization of results for query and visualization. It creates an easily installable, reliable infrastructure from best-practice open source tools with the following goals:
This is a teaching orientated example of using bcbio from the Cold Spring Harbor Laboratory's Advanced Sequencing Technology and Applications course. This uses cancer tumor normal data from the ICGC-TCGA DREAM synthetic 3 challenge, subset to exomes on chromosome 6 to reduce runtimes. It demonstrates:
To save downloading the genome data and running the analysis, we have a pre-prepared AMI with the data and analysis run. Use the AWS Console to launch the pre-built AMI -- search Community AMIs for ami-5e84fe34. Any small instance type is fine for exploring the configuration, run directory and output files. Make sure you associate a public IP and a security group that allows remote ssh.
Once launched, ssh into the remote machine with ssh -i your-keypair ubuntu@public.ip.address to explore the inputs and outputs. The default PATH contains bcbio and third party programs in /usr/local/bin, with the biological data installed in /usr/local/share/bcbio. The run is in a ~/run/cancer-syn3-chr6.
To run bcbio, you prepare a small configuration file describing your analysis. You can prepare it manually or use an automated configuration method. The example has a pre-written configuration file with tumor/normal data located in the
``
config` directory and this section walks through the settings.
You define the type of analysis (variant calling) along with the input files and genome build:
analysis: variant2 files: [../input/cancer-syn3-chr6-tumor-1.fq.gz, ../input/cancer-syn3-chr6-tumor-2.fq.gz] genome_build: hg38
Sample description and assignment as a tumor sample, called together with a matched normal:
description: syn3-tumor metadata:
batch: syn3
phenotype: tumor
sex: female
Next it defines parameters for running the analysis. First we pick our aligner (bwa mem):
algorithm:
aligner: bwa
Post-alignment, we mark duplicates but do not perform recalibration and realignment:
mark_duplicates: true recalibrate: false realign: false
We call variants in exome regions on chromosome 6 using a BED file input, call variants as low as 2% in the tumor sample, and use 3 variant callers along with an ensemble method that combines results for any found in 2 out of 3:
variant_regions: ../input/NGv3-chr6-hg38.bed min_allele_fraction: 2 variantcaller: [vardict, freebayes, varscan] ensemble:
numpass: 2
For structural variant calling, we use two callers and prioritize variants to those found in the CIViC database:
svcaller: [lumpy, manta] svprioritize: cancer/civic
Call HLA types with OptiType:
hlacaller: optitype
Finally, we validate both the small variants and structural variants. These use pre-installed validation sets that come with bcbio. We limit validation regions to avoid low complexity regions, which cause bias in
``validating indels <http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/>`_:
exclude_regions: [lcr] validate: dream-syn3-crossmap/truth_small_variants.vcf.gz validate_regions: dream-syn3-crossmap/truth_regions.bed svvalidate:
DEL: dream-syn3-crossmap/truth_DEL.bed
DUP: dream-syn3-crossmap/truth_DUP.bed
INV: dream-syn3-crossmap/truth_INV.bed
Output files are in ~/run/cancer-syn3-chr6/final, extracted from the full work directory in ~/run/cancer-syn3-chr6/work.
The directories with sample information are in syn3-tumor/. Aligned BAMs include a -ready.bam file with all of the original reads (including split and discordants) and separate files with only the split (-sr.bam) and discordant (-disc.bam) reads:
syn3-tumor-ready.bam syn3-tumor-ready.bam.bai syn3-tumor-sr.bam syn3-tumor-sr.bam.bai syn3-tumor-disc.bam syn3-tumor-disc.bam.bai
SNP and indel calls for 3 callers, plus combined ensemble calls:
syn3-tumor-ensemble.vcf.gz syn3-tumor-ensemble.vcf.gz.tbi syn3-tumor-freebayes.vcf.gz syn3-tumor-freebayes.vcf.gz.tbi syn3-tumor-varscan.vcf.gz syn3-tumor-varscan.vcf.gz.tbi syn3-tumor-vardict.vcf.gz syn3-tumor-vardict.vcf.gz.tbi
Structural variant calls for 2 callers, plus a simplified list of structural variants in cancer genes of interest:
syn3-tumor-sv-prioritize.tsv syn3-tumor-lumpy.vcf.gz syn3-tumor-lumpy.vcf.gz.tbi syn3-tumor-manta.vcf.gz syn3-tumor-manta.vcf.gz.tbi
HLA typing results:
syn3-tumor-hla-optitype.csv
Validation results from comparisons against truth set, including plots:
syn3-tumor-sv-validate.csv syn3-tumor-sv-validate-DEL.png syn3-tumor-sv-validate-df.csv syn3-tumor-sv-validate-DUP.png syn3-tumor-sv-validate-INV.png syn3-tumor-validate.png
The top level directory for the project, 2015-11-18_syn3-cshl/ has files relevant to the entire run. There is a consolidated quality control report:
multiqc/multiqc_report.html
Povenance information, with log files of all commands run and program versions used:
bcbio-nextgen.log bcbio-nextgen-commands.log programs.txt data_versions.csv
A top level summary of metrics for alignment, variant calling and coverage that is useful downstream:
project-summary.yaml
The steps to prepare an AMI from a bare machine and run the analysis. These are pre-done on the teaching AMI to save time:
ssh -i ~/.ec2/your-key.pem ubuntu@public-ip
sudo apt-get update sudo apt-get install -y build-essential zlib1g-dev wget curl python-setuptools git \
openjdk-7-jdk openjdk-7-jre ruby libncurses5-dev libcurl4-openssl-dev \
libbz2-dev unzip pigz bsdmainutils wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir /usr/local \
--genomes hg38 --aligners bwa --sudo --isolate -u development
mkdir -p run cd run wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/teaching/cancer-syn3-chr6-prep.sh bash cancer-syn3-chr6-prep.sh
cd cancer-syn3-chr6/work bcbio_nextgen.py ../config/cancer-syn3-chr6.yaml -n 16
https://github.com/bcbio/bcbio-nextgen
Data was analyzed with bcbio-nextgen (https://github.com/bcbio/bcbio-nextgen) using piDNA to detect the adapter, cutadapt to remove it, STAR/bowtie to align against the genome and seqcluster to detect small RNA transcripts. miRNAs were detected using miraligner tool with miRBase as the reference miRNA database. tRNA profiles were detected using tdrmapper tool. mirdeep2 was used for discovery of novel miRNAs. FastQC was used for QC metrics and multiqc for reporting.
Download BIB format: https://github.com/bcbio/bcbio-nextgen/tree/master/docs/contents/misc/bcbio-smallrna.bib
For the alignment, add what you have used:
If you used TopHat2 for alignment:
If you have in the output novel miRNA discovering, add:
If you have tRNA mapping output, add:
If you have miRge activated:
If you have MINTmap activated:
Brad Chapman
2019, bcbio-nextgen contributors
January 23, 2019 | 1.1.2 |