Entering edit mode
20 months ago
sohail.mbio
•
0
I am analyzing the RNAseq data to quantify transcript variants for a specific gene. Ensembl database shows that it has 11 transcripts (splice variants). First I aligned the RNAseq data using Kallisto to quantify transcripts. In second method, I aligned RNAseq data against human genome, and then counted the features (Transcrip IDs) from sorted bam files using subread . There is huge difference between raw counts of transcripts that specific gene. What method is more reliable?
How about pulling out all the reads that map to that gene and inspecting them yourself?
I just looked at the gene structure on the genome browser -- transcript IDs ENST00000418488 and ENST00000446632 look basically the same sans a couple of <150bp exons are different between the two.
Most reads will map to exons that are shared by those transcripts, resulting in a huge identifiability problem.
I don't think either method is going to give you a correct answer here... Sure, you can get some sort of maximum likelihood estimate and possibly quantify the uncertainty (e.g. with kallisto's bootstrap method) but that's about the best you can do.
Edit: Time for you to invest in long-read sequencing.
Thank you. This is very helpful. I agree with short-read sequencing data, I can not answer this question. In I am interested in only one specific gene, specific qPCR primers designed for variants may do the work.
I am trying to quantify Gene: IL5RA ENSG00000091181. When I quantify using Kallisto, counts table show that Transcript ID ENST00000418488 is present dominantly > 90%. Where is alignment by STAR shows that all transcripts are present well proportionly.
I guess title of post is a little misleading. I know how both methods work. But I am surprised by transcript quantities of IL5RA gene by both methods. Kallisto is showing almost all IL5RA gene transcripts are ENST00000418488, whereas, STAR alignment shows all 11 transcripts are detected. PS: I have not looked for any other gene.
If you're familiar with how both methods work, then you should not be surprised by this result. Just think about how the EM procedure of Kallisto quantification (see the RSEM paper for deep detail) would impact reads in exons shared by all transcripts.
Thanks. This is very helpful. Still, the differences I see are huge. It could be that IL5RA gene structure complexity with highly similar transcript isoforms.
Are there any methods to pull the specific exons sequence from FASTQ or BAM files? Any suggestions ... I have very little knowledge/experience with gene transcript variant analysis