Hello,
I have a conceptual question regarding miRNAseq data from TCGA and the relationship between isoform quantitation and gene expression. I want to get the total number of reads for each mature miRNA sequence. So, for hsa-let-7a-5p (MIMAT0000062) which has 3 isoforms (i.e. let-7a-1, let-7a-2, and let-7a-3), I want to sum the read counts for each isoform to get an aggregate number. I have downloaded data for PRAD from GDAC Firehose which is in TCGA format, but all samples are in the same file. When I sum the read counts for each miR isoform and the reads-per-million in the same manner, the latter matches the GDAC Firehose mature pre-process file which has RPM data for a particular mature miRNA. This suggests I am analyzing the data correctly to get count data, assuming the Broad people know their stuff. I've read a paper that does the same. What I don't understand is how it is possible to map an identical sequence back to a unique location, and I don't want to double- or triple-count (in this case) for each isoform when assigning counts to a mature miRNA.
For example, these regions for let-7a-5p are identical mature miRNA sequences:
hsa-let-7a-1 isoform: hg19:9:96938244-96938265:+ (21362 reads in the miR isoform file)
hsa-let-7a-3 isoform: hg19:22:46508632-46508653:+ (21189 reads in the miR isoform file)
In fact, there are no identical read counts between the two isoforms, and these reads are not flagged as cross-mapped. How are the reads assigned to the correct isoform?
Why are the read counts not identical, since the sequences are identical and of the same length? There are reads for hg19:9:96938244-96938266:+ and other adjacent sequences, so it is not like additional nucleotides are being used to help map the sequence.
I have searched biostars, the TCGA website, the GDAC website, and Google to no avail. I even read the data processing description from the Synapse website, but that didn't help me, given the fact that the reads are not cross-mapped.
If nobody knows the answer, any ideas on where to ask next?
Thanks in advance!