NAME

QTLtools trans - trans QTL analysis

SYNOPSIS

DESCRIPTION

This mode maps trans (distal) quantitative trait loci (QTLs) that affect the phenotypes, using linear regression. The method is detailed in <https://www.nature.com/articles/ncomms15452>. We first regress out the provided covariates from the phenotype data, followed by running the linear regression between the phenotype residuals and the genotype. If --normal and --cov are provided at the same time, then the residuals after the covariate correction are rank normal transformed. It incorporates an efficient permutation scheme. You can run a nominal pass (--nominal) listing all genotype-phenotype associations below a certain threshold, a permutation pass (--permute or --sample no_genes_to_sample) to empirically characterize the null distribution of associations, or adjust the nominal p-values based on permutations (--adjust).

In the full permutation scheme (--permute) we permute all phenotypes using the same random number sequence to preserve the correlation structure. By doing so, the only association we actually break in the data is between the genotype and the phenotype data. Then, we proceed with a standard association scan identical to the one used in the nominal pass. In practice, we repeat this for 100 permutations of the phenotype data. Subsequently, we can proceed with FDR correction by ranking all the nominal p-values in ascending order and by counting how many p-values in the permuted data sets are smaller. This provides an FDR estimate: if we have 500 p-values in the permuted data sets that are smaller than the 100th smallest nominal p-value, we can then assume that the FDR for the 100 first associations is around 5% (=500/(100 × 100)).

To enable fast screening in trans, we also designed an approximation of the method described just above based on what we already do in cis. To make it possible, we assume that the phenotypes are independent and normally distributed (which can be enforced with --normal). The idea is that since all phenotypes are normally distributed, effectively they are the same, and also the cis region removed from each phenotype is so small compared to rest of the genome that its phenotype specific impact is negligible. Hence the number of and the correlation amongst variants for each phenotype is approximately the same, and each phenotype is approximately the same; thus we can run permutations with a small number of phenotypes rather then all, which drastically decreases the computational burden and the null distribution generated can be applied to all phenotypes. The implementation draws from the null by permuting some randomly chosen phenotypes, testing for associations with all variants in trans and storing the smallest p-value. When we repeat this many times (typically 1000), effectively building a null distribution of the strongest associations for a single phenotype. We then make it continuous by fitting a beta distribution as we do in cis and use it to adjust every nominal p-value coming from the initial pass for the number of variants being tested. To correct for the number of phenotypes being tested, we estimate FDR as we do in cis; that is from the best adjusted p-values per phenotype (one per phenotype). This also gives an adjusted p-value threshold that we use to identify all phenotype-variant pairs that are whole-genome significant. In our experiments, this approach gives similar results to the full permutation scheme both in term of FDR estimates and number of discoveries, while running faster.

Since linear regressions assumes normally distributed data, we highly recommend using the --normal option to rank normal transform the phenotype quantifications in order to avoid false positive associations due to outliers. If you are using the approximate permutation scheme (--sample) you MUST use the --normal option or make sure that your phenotypes are normally distributed.

OPTIONS

--vcf [in.vcf|in.bcf|in.vcf.gz|in.bed.gz]: Genotypes in VCF/BCF format, or another molecular phenotype in BED format. If there is a DS field in the genotype FORMAT of a variant (dosage of the genotype calculated from genotype probabilities, e.g. after imputation), then this is used as the genotype. If there is only the GT field in the genotype FORMAT then this is used and it is converted to a dosage. REQUIRED.
--bed quantifications.bed.gz: Molecular phenotype quantifications in BED format. REQUIRED.
--out output.txt: Output file. REQUIRED.
--cov covariates.txt: Covariates to correct the phenotype data with.
--normal: Rank normal transform the phenotype data so that each phenotype is normally distributed. RECOMMENDED.
--window integer: Size of the cis window to remove flanking each phenotype's start position. DEFAULT=5000000.
--threshold float: P-value threshold below which hits are reported. Give 1.0 to print everything, which may generate a huge file. When --adjust is provided, this threshold applies to the adjusted p-values. DEFAULT=1e-5.
--bins integer: Number of bins to use to categorize all p-values above --threshold. DEFAULT=1000.
--nominal: Calculate the nominal p-value for the genotype-phenotype associations and print out the ones that pass the provided threshold. Mutually exclusive with --permute, --sample and --adjust.
--permute: Permute all phenotypes together, once. For multiple permutations you need to change the random seed using --seed for each permutation. Mutually exclusive with --nominal, --sample and --adjust.
--sample integer: Permute randomly chosen phenotypes integer times. Mutually exclusive with --nominal, --permute, --adjust, and --chunk.
--adjust filename: Test and adjust p-values using the null distribution in filename. Mutually exclusive with --nominal, --permute, and --sample.
--chunk integer1 integer2: For parallelization. Divide the data into integer2 number of chunks and process chunk number integer1. Minimum number of chunks has to be at least the same number of chromosomes in the --bed file.

OUTPUT FILES

.hits.txt.gz

Space separated results output file detailing the variant-phenotype pairs that pass the threshold with the following columns:

1	The phenotype ID
2	The phenotype chromosome
3	Start position of the phenotype
4	The variant ID
5	The variant chromosome
6	The start position of the variant
7	The nominal p-value of the association between the variant and the phenotype.
8	The adjusted p-value of the association between the variant and the phenotype. Requires --adjust
9	Correlation coefficient

.best.txt.gz

Space separated output file listing the most significant variant per phenotype.

1	The phenotype ID
2	The adjusted p-value of the association between the variant and the phenotype. Requires --adjust
3	The nominal p-value of the association between the variant and the phenotype.
4	The variant ID

.bins.txt.gz

Space separated output file containing the binning of all hits with a p-value below the specified --threshold.

1	The index of the bin
2	The lower bound of the correlation coefficient for this bin
3	The upper bound of the correlation coefficient for this bin
4	The upper bound of the p-value for this bin
5	The lower bound of the p-value for this bin

FULL PERMUTATION ANALYSIS EXAMPLE

1

Run a nominal analysis, rank normal transforming the phenotypes and outputting all associations with a p-value below 1e-5:

QTLtools trans --vcf genotypes.chr22.vcf.gz --bed genes.simulated.chr22.bed.gz --nominal --normal --out trans.nominal

2

Run a full permutation analysis with 100 jobs on a compute cluster, run the following making sure that you change the seed for each permutation iteration (qsub needs to be changed to the job submission system used [bsub, psub, etc...])

for j in $(seq 1 100); do

echo "QTLtools trans --vcf genotypes.chr22.vcf.gz --bed genes.simulated.chr22.bed.gz --permute --normal --out trans.perm$j.txt --seed $j" | qsub

done

APPROXIMATE PERMUTATION ANALYSIS EXAMPLE

1: Build the null distribution randomly selecting 1000 phenotypes, and rank normal transforming the phenotypes:
: QTLtools trans --vcf genotypes.chr22.vcf.gz --bed genes.simulated.chr22.bed.gz --sample 1000 --normal --out trans.sample
2: Run the nominal pass adjusting the p-values with the given null distribution, rank normal transforming the phenotypes, and printing out associations with an adjusted p-value less than 0.1:
: QTLtools trans --vcf genotypes.chr22.vcf.gz --bed genes.simulated.chr22.bed.gz --adjust trans.sample.best.txt.gz --threshold 0.1 --normal --out trans.adjust

BUGS

o: Versions up to and including 1.2, suffer from a bug in reading missing genotypes in VCF/BCF files. This bug affects variants with a DS field in their genotype's FORMAT and have a missing genotype (DS field is .) in one of the samples, in which case genotypes for all the samples are set to missing, effectively removing this variant from the analyses.

Please submit bugs to <https://github.com/qtltools/qtltools>

CITATION

Delaneau, O., Ongen, H., Brown, A. et al. A complete tool set for molecular QTL discovery and analysis. Nat Commun 8, 15452 (2017). <https://doi.org/10.1038/ncomms15452>

AUTHORS

Halit Ongen (halitongen@gmail.com), Olivier Delaneau (olivier.delaneau@gmail.com)