SEQCLUSTER(1) | seqcluster | SEQCLUSTER(1) |
seqcluster - seqcluster Documentation [image: seqcluster banner] [image]
Analysis of small RNA sequencing data. It detect unit of transcription over the genome, annotate them and create an HTML interactive report that helps to explore the data quickly.
Contents:
With bcbio installed
If you already have
`bcbio`_, seqcluster comes with it. If you want the last development version:
/bcbio_anaconda_bin_path/seqcluster_install.py --upgrade
Docker:
docker pull lpantano/smallsrna
Bioconda binary
install conda if you want an isolate env:
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/seqcluster/anaconda
You can install directly from binstar (only for linux):
~/install/seqcluster/anaconda/conda install seqcluster seqbuster bedtools samtools pip nose numpy scipy pandas pyvcf -c bioconda
With that you will have everything you need for the python package. The last step is to add seqcluster to your PATH if conda is not already there.
Go to Tools dependecies below to continue with the installation.
Note: After installation is highly recommended to get the last updated version doing:
seqcluster_install.py --upgrade
automated installation
Strongly recommended to use bcbio installation if you work with sequencing data. But if you want a minimal installation:
pip install fabric seqcluster_install --upgrade mkdir -p $PATH_TO_TOOLS/bin seqcluster_install --tools $PATH_TO_TOOLS
After that you will need to add to your path: export PATH=$PATH_TO_TOOLS/bin:$PATH
For seqcluster command:
For some steps of a typical small RNA-seq pipeline (recommended to use directly
`bcbio`_
You will need to link the cutadapt binary to your PATH)
Easy way to install your small RNA seq data with cloudbiolinux. Seqcluster has snipped code to do that for you. Recommended to use
`bcbio`_
for the pipeline since will install everything you need in a single step bcbio_nextgen.py upgrade -u development --tools --genomes hg19 --aligners bowtie.
But If you want to run seqcluster step by step an example of hg19 human version it will be (another well annotated supported genome is mm10):
Download genome data:
seqcluster_install --data $PATH_TO_DATA --genomes hg19 --aligners bowtie2 --datatarget smallrna
If you want to install STAR indexes since gets kind of better results than bowtie2 (warning, 40GB memory RAM needed):
seqcluster_install --data $PATH_TO_DATA --genomes hg19 --aligners star
Install isomiRs package for R using devtools:
devtools::install_github('lpantano/isomiRs')
To install all packages used by the Rmd report:
Rscript -e 'source(https://raw.githubusercontent.com/lpantano/seqcluster/master/scripts/install_libraries.R)'
Please if you use seqcluster make sure to cite the other tools are integrated here:
A non-biased framework for the annotation and classification of the non-miRNA small RNA transcriptome. Pantano L1, Estivill X, Martí E. Bioinformatics. 2011 Nov 15;27(22):3202-3. doi: 10.1093/bioinformatics/btr527. Epub 2011 Oct 5. PMID: 21976421
SeqBuster is a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Pantano L, Estivill X, Martí E. Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.
Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.
Dale RK, Pedersen BS, and Quinlan AR. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics (2011). doi:10.1093/bioinformatics/btr539
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
Li H A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. Epub 2011 Sep 8. [PMID: 21903627]
Best practices are implemented in a python framework.
seqcluster generates a list of clusters of small RNA sequences, their genome location, their annotation and the abundance in all the sample of the project [image]
REMOVE ADAPTER
I am currently using cutadapt:
cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq
COLLAPSE READS
To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances. Like removing sequences that appear only once.
seqcluster collapse -f sample1_clean.fastq -o collapse
Here I am only using sequences that had the adapter, meaning that for sure are small fragments.
This is compatible with UMI barcodes. If you have in the read name UMI_ATCGAT ``, then the tool will remove PCR dupiclates as well. To confirm this happened, the tool should output this sentence during the processing of the file: ``Find UMI tags in read names, collapsing by UMI.
PREPARE SAMPLES
seqcluster prepare -c file_w_samples -o res --minl 17 --minc 2 --maxl 45
the file_w_samples should have the following format:
lane1_sequence.txt_1_1_phred.fastq cc1 lane1_sequence.txt_2_1_phred.fastq cc2 lane2_sequence.txt_1_1_phred.fastq cc3 lane2_sequence.txt_2_1_phred.fastq cc4
two columns file, where the first column is the name of the file with the small RNA sequences for each sample, and the second column in the name of the sample.
The fastq files should be like this:
@seq_1_x11 CCCCGTTCCCCCCTCCTCC + QUALITY_LINE @seq_2_x20 TGCGCAGTGGCAGTATCGTAGCCAATG + QUALITY_LINE </pre>
Where _x[09] indicate the abundance of that sequence, and the middle number is the index of the sequence.
This script will generate: seqs.fastq and seqs.ma. * seqs.fastq: have unique sequences and unique ids * seqs.ma: is the abundance matrix of all unique sequences in all samples
ALIGNMENT
You should use an aligner to map seqs.fa to your genome. A possibility is bowtie or STAR. From here, we need a file in BAM format for the next step. VERY IMPORTANT: the BAM file should be sorted
bowtie -a --best --strata -m 5000 INDEX seqs.fastq -S | samtools view -Sbh /dev/stdin | samtools sort -o /dev/stdout temp > seqs.sort.bam
or
STAR --genomeDir $star_index_folder --readFilesIn res/seqs.fastq --alignIntronMax 1 --outFilterMultimapNmax 1000 --outSAMattributes NH HI NM --outSAMtype BAM SortedByCoordinate
CLUSTERING
seqcluster cluster -a res/Aligned.sortedByCoord.out.bam -m res/seqs.ma -g $GTF_FILE -o res/cluster -ref PATH_TO_GENOME_FASTA --db example
Example of a bed file for annotation (the fourth column should be the name of the feature):
chr1 157783 157886 snRNA 0 -
Strongly recommend gtf format. Bed annotation is deprecated. Go here to know how to download data from hg19 and mm10.
Example of a gtf file for annotation (the third column should be the name of the feature and the value after gene name attribute is the specific annotation):
chr1 source miRNA 1 11503 . + . gene name 'mir-102' ;
hint: scripts to generate human and mouse annotation are inside seqcluster/scripts folder.
OUTPUTS
This will create html report using the following command assuming the output of seqcluster cluster is at res:
seqcluster report -j res/seqcluster.json -o report -r $GENONE_FASTA_PATH
where $GENOME_FASTA_PATH is the path to the genome fasta file used in the alignment.
Note: you can try our new visualization tool!
An example of the output is below: [image]
Note:If you already are using bcbio, visit bcbio to run the pipeline there.
To install the small RNA data:
bcbio_nextgen.py upgrade -u development --tools --datatarget smallrna
Options to run in a cluster
It uses ipython-cluster-helper to send jobs to nodes in the cluster
Read complete usability here: https://github.com/roryk/ipython-cluster-helper An examples in slurm system is:
--parallel ipython --scheduler slurm --num-jobs 4 --queue general
Output
Beside the static HTML report that you can get using report subcommand, you can download this HTML. (watch the repository to get notifications of new releases.)
Meaning of different sections:
An example of the HTML code: _ ..examples
About
mirRQC project
samples overview:
>> Universal Human miRNA reference RNA (Agilent Technologies, #750700), human brain total RNA (Life Technologies, #AM6050), human liver total RNA (Life Technologies, #AM7960) and MS2-phage RNA (Roche, #10165948001) were diluted to a platform-specific concentration. RNA integrity and purity were evaluated using the Experion automated gel electrophoresis system (Bio-Rad) and Nanodrop spectrophotometer. All RNA samples were of high quality (miRQC A: RNA quality index (RQI, scale from 0 to 10) = 9.0; miRQC B: RQI = 8.7; human liver RNA: RQI = 9.2) and high purity (data not shown). RNA was isolated from serum prepared from three healthy donors using the miRNeasy mini kit (Qiagen) according to the manufacturer's instructions, and RNA samples were pooled. Informed consent was obtained from all donors (Ghent University Ethical Committee). Different kits for isolation of serum RNA are available; addressing their impact was outside the scope of this work. Synthetic miRNA templates for let-7a-5p, let-7b-5p, let-7c, let-7d-5p, miR-302a-3p, miR-302b-3p, miR-302c-3p, miR-302d-3p, miR-133a and miR-10a-5p were synthesized by Integrated DNA Technologies and 5′ phosphorylated. Synthetic let-7 and miR-302 miRNAs were spiked into MS2-phage RNA and total human liver RNA, respectively, at 5 × 106 copies/μg RNA. These samples do not contain endogenous miR-302 or let-7 miRNAs, which allowed unbiased analysis of cross-reactivity between the individual miR-302 and let-7 miRNAs measured by the platform and the different miR-302 and let-7 synthetic templates in a complex RNA background. Synthetic miRNA templates for miR-10a-5p, let-7a-5p, miR-302a-3p and miR-133a were spiked in human serum RNA at 6 × 103 copies per microliter of serum RNA or at 5-times higher, 2-times higher, 2-times lower and 5-times lower concentrations, respectively. All vendors received 10 μl of each serum RNA sample.
Commands
Data was download from GEO web with this script. The following 2 configs were used for the two sets: mirqc samples and non mirqc samples. Samples were analyzed with bcbio with the following commands
report
Report showing part of the output report of bcbio pipelines together with some validations are here.
miRNA annotation is running inside bcbio small RNAseq pipeline together with other tools to do a complete small RNA analysis.
For some comparison with other tools go here.
You can run samples after processing the reads as shown below. Currently there are two version: JAVA
Naming
See always up to date information here in mirtop open project.
It is a working process, but since 10-21-2015 isomiR naming has changed to:
REMOVE ADAPTER
I am currently using cutadapt.
cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq
COLLAPSE READS
To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances. Like removing sequences that appear only once.
seqcluster collapse -f sample1_clean.fastq -o collapse
Here I am only using sequences that had the adapter, meaning that for sure are small fragments. The output will be named as sample1_clean_trimmed.fastq
For human or mouse, follows this instruction to download easily miRBase files. In general you only need hairpin.fa and miRNA.str from miRBase site. mirGeneDB is also supported, download the needed files here.
Highly recommended to filter hairpin.fa to contain only the desired species.
MIRALIGNER
Download the tool from miraligner repository.
Download the mirbase files (hairpin and miRNA) from the ftp and save it to DB folder.
You can map the miRNAs with.
java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i sample1_clean_trimmed.fastq -db DB -o output_prefix
Cite
SeqBuster is a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Pantano L, Estivill X, Martí E. Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.
NOTE: Check comparison of multiple tools for miRNA annotation.
Use mirtop to convert to GFF3-srna format. This is the desired format to share the isomiR information and can be used to join multiple projects together easily.
See to know how to convert all the output into a single file and share easily with collaborators:
mirtop gff --format seqbuster --sps hsa --hairpin database/hairpin.fa --gtf database/hsa.gff3 -o test_out out_folder/*/*.mirna
Use the outputs to do differential expression, clustering and descriptive analysis with this package: isomiRs
To load the data you can use IsomirDataSeqFromFiles function and get the count data with isoCounts to move to DESeq2 or similar packages.
options
Add -freq if you have your fasta/fastq file with this format and you want a third column with the frequency (normally value after x character):
>seq_1_x4 CACCGCTGTCGGGGAACCGCGCCAATTT
Add -pre if you want also sequences that map to the precursor but outside the mature miRNA
input
A fasta/fastq file reads:
>seq CACCGCTGTCGGGGAACCGCGCCAATTT
or tabular file with counts information:
CACCGCTGTCGGGGAACCGCGCCAATTT 45
output
Track file
*.mirna.opt: information about the process
Non mapped sequences will be on
*.nomap
Header of the
*.mirna.out file:
precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence add: TT => TGTGGTTAGCTGGTTGCATATTT
precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence tr5: CC => CCTGTGGTTAGCTGGTTGCATAT sequence tr5: tg => TGGTTAGCTGGTTGCATAT
precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence tr3: cc => TGTGGTTAGCTGGTTGCATATCC sequence tr3: AT => TGTGGTTAGCTGGTTGCAT
precursor => agcctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT s5 => AGCCTGTG
precursor => cctgtggttagctggttgcatatccgc annotated miRNA => TGTGGTTAGCTGGTTGCATAT s3 => ATATCCGC
Example:
seq miRNA start end mism tr5 tr3 add s5 s3 DB amb TGGCTCAGTTCAGCAGGACC hsa-mir-24-2 50 67 0 qCC 0 0 0 0 precursor 1 ACTGCCCTAAGTGCTCCTTCTG hsa-miR-18a* 47 68 0 0 0 tG ATCTACTG CTGGCA miRNA 1
Definition
Normally quality values are lost in small RNA-seq pipelines due to collapsing after adapter recognition. This option allow to collapse reads after adapter removal with cutadapt or any other tool. This way the mapping can use quality values, allowing to map using bwa for instance, or any other alignment tool that doesn't support FASTA files.
Methods
The new quality values are the average of each of the sequence collapse.
Example
seqcluster collapse -f sample_trimmed.fastq -o collapse
@seq_[0-9]_x[0-9]
The number right after _x means the abundance of this sequence in the sample
Definition
multi-mapped reads are the sequences that map more than one time on the genome, for instance, because there are multiple copies of a gene, like happens with tRNA precursors
Consequence
Many pipelines ignores these sequences as defaults, what means that you are losing at leas 20-30% of the data. In this case is difficult to decide where these sequences come from and currently there are three strategies:
Our implementation [image]
We try to decide the origin of these sequences. The most common scenario is that a group of sequences map two three different regions, probably due to multi-copies on the genome of the precursor.
We introduce two options:
The main advantage of this, it is that it can be the input of any downstream analysis that is applied to RNA-seq, like DESeq, edgeR ... As well, there is less noise, because there is only one output coming from here, not three.
TFmiR: disease-specific miRNA/transcription factor co-regulatory networks v1.2. It uses results from UP/DOWN regulated miRNA/Genes and allows to focus in only one disease to create different type of relationships between miRNA/TF/Gene. Easy to use. Probably need to filter the output sometime due to the big networks that can result from an analysis.
Diana-TarBase v7.0: Database for validated miRNA targets. Many filter options. Good for small candidate miRNAs set studies.
StarScan: Database to browse the targets of miRNAs from degradome data. It has a fancy interface, and many species and data from GEO.
miRtex gives targets from literature. Good for finding validated targets to help discussion in papers or further functional experiment based on new hypothesis.
piRBase: Database for piRNA annotation and function. Published last year, for now the best I can find out there.
chimira: Web tool to analyze isomiR. It gives you a quick idea of you samples.
MicroCosm: MiRNA target database. Updated and download option.
IsomiR Bank: isomiR database from many species and tissues. For single queries is useful.
miRVaS : tools to predict the functional changed due to nt changes in the miRNA sequence.
Naturally existing isoforms of miR-222 have distinct functions: this work demonstrates the capacity for 3' isomiRs to mediate differential functions, we contend more attention needs to be given to 3' variance given the prevalence of this class of isomiR.
miR-142-3p isomiR: "We furthermore demonstrate that miRNA 5′-end variation leads to differential targeting and can thus broaden the target range of miRNAs."
A highly expressed miR-101 isomiR is a functional silencing small RNA.
A challenge for miRNA: multiple isomiRs in miRNAomics.
miR-183-5p isomiR changes in breast cancer. Validated target regulation of new genes different from the reference miRNA.
A comprehensive survey of 3' animal miRNA modification events and a possible role for 3' adenylation in modulating miRNA targeting effectiveness.
PAPD5-mediated 3′ adenylation and subsequent degradation of miR-21 is disrupted in proliferative disease.
High-resolution analysis of the human retina miRNome reveals isomiR variations and novel microRNAs.
Sequence features of Drosha and Dicer cleavage sites affect the complexity of isomiRs.
Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types
A novel piRNA mechanism in regulating gene expression in highly differentiated somatic cells.
Differential and coherent processing patterns from small RNAs to detect changes in profiles of processing small RNAs.
Survey of 800+ datasets from human tissue and body fluid reveals XenomiRs are likely artifacts
Identification of factors involved in target RNA-directed microRNA degradation.
miRQC: work studying the accuracy and specificity of different technologies to detect miRNAs.
Important features affecting the detection of small RNA biomarkers: How the sample can affect the detection of biomarkers (like RIN value, concentration, ...)
Comparison of alignment and normalization . I will take the message that TMM and DESeq/2 normalization are the best to avoid strong bias if we consider to have a small proportion of DE miRNAs. For the alignments, here you have another comparison for miRNAs annotation: https://rawgit.com/lpantano/tools-mixer/master/mirna/mirannotation/stats.html
review of tools for detect miRNA-disease network.
review of tools for miRNA de-novo and interaction analysis
Evaluation of microRNA alignment techniques BIG meeting on Dec,3 2015: bcbio-srnaseq-BIG-20151203.pdf
Visit GitHub code
I am in the process to document all classes and methods
Lorena Pantano, Francisco Pantano, Eulalia Marti
2023, Lorena Pantano
January 21, 2023 | 1.2 |