RSEM-GENERATE-NGVECTOR(1) | User Contributed Perl Documentation | RSEM-GENERATE-NGVECTOR(1) |
rsem-generate-ngvector - Create Ng vector for EBSeq based only on transcript sequences.
rsem-generate-ngvector [options] input_fasta_file output_name
This program generates the Ng vector required by EBSeq for isoform level differential expression analysis based on reference sequences only. EBSeq can take variance due to read mapping ambiguity into consideration by grouping isoforms with parent gene's number of isoforms. However, for de novo assembled transcriptome, it is hard to obtain an accurate gene-isoform relationship. Instead, this program groups isoforms by using measures on read mappaing ambiguity directly. First, it calculates the 'unmappability' of each transcript. The 'unmappability' of a transcript is the ratio between the number of k mers with at least one perfect match to other transcripts and the total number of k mers of this transcript, where k is a parameter. Then, Ng vector is generated by applying Kmeans algorithm to the 'unmappability' values with number of clusters set as 3. 'rsem-generate-ngvector' will make sure the mean 'unmappability' scores for clusters are in ascending order. All transcripts whose lengths are less than k are assigned to cluster 3.
If your reference is a de novo assembled transcript set, you should run 'rsem-generate-ngvector' first. Then load the resulting 'output_name.ngvec' into R. For example, you can use
NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of section 3.2.5 (Page 10) of EBSeq's vignette:
IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
This program only needs to run once per RSEM reference.
Suppose the reference sequences file is '/ref/mouse_125/mouse_125.transcripts.fa' and we set the output_name as 'mouse_125':
rsem-generate-ngvector /ref/mouse_125/mouse_125.transcripts.fa mouse_125
2022-09-19 | perl v5.34.0 |