Entering edit mode
14 months ago
serodyc
▴
20
I am attempting to count the number of reads of each isoform of the Rt-GEF gene in Drosophila across multiple sam files. My sam files are currently formatted so that reads are listed by coordinates on chromosomes, such as
VH00562:14:AAANWG5HV:1:1101:26828:1568 1:N:0:ACTAAGAT+GCGGTTGT 99 chrX 5075295 42 50= = 5075328 83 CTTTTAAAAAAAAATCAATACCTTACTTAAACTAACTATGCAAAAAATCG CCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCCCCCCCCCCCC NM:i:0 AM:i:42
.
However, I am trying to get my data in a format that converts these to the isoform expressed by the read, ideally using a FlyBase ID so that I can count them. Is there any way to change the files to a more readable format? Thanks!
Isoforms typically share most of its exonic content. Hence, just because a read covers an exon of an isoform it does not mean that this isoform is "expressed". Most reads per gene are actually ambiguous in terms of which isoform they map to. Imagine an exon is perfectly shared between two isoforms and the read maps to that exon -- you cannot tell which isoform it comes from. A better way is to use tools like salmon or kallisto which quantify in isoform/transcript modeand then use an EM algorithm under the hood to decide which isoforms are actually expressed.