Hello everyone, I have a bunch of gtf files for some RNA samples "PacBio Sequell II" from the mouse encode project. I was trying to get some counts for transcripts to perform some differential expression and usage testing. However, I noticed that basically novel transcripts have identifiers per sample library identifier. So, I was trying to produce a unified gtf file where identical novel transcripts receive a unified identifier across all samples so that I can generate my counts effectively. I used gffcompare for the merge and it was successful with identical transcripts where you have identical everything "start, end, exon number, exon coordinates, and intron chain". My question is, I noticed that gffcompare is merging some transcripts where they have identical intron chains but the length is different due to the end of the terminal exon "3' end" being different among them. Are these actually biologically distinct transcripts? Should I work with them as one transcript or consider keeping them as distinct isoforms? if this is the case how to do that? here is an example to illustrate the problem: q3:ENSMUSG00000028284.13|ENCLB795KSJT000172206|----> length:964 q4:ENSMUSG00000028284.13|ENCLB483YFTT000202727|-----> length:3804 q6:ENSMUSG00000028284.13|ENCLB728YOBT000208749|-----> length:2196
they all have identical "start, exon structure, exon coordinates, splice junctions" they only differ in the end location of the terminal exon.
Thank you for helping out.
well, thank you so much for your response. I did some research and apparently the problem is that terminal exons contain UTR, so when I looked at a bunch of these transcripts it looked like they have different UTRs but all have the same CDS sequence of the last exon. since my research is based on junction analysis I may combine them as one transcript and this shouldn't be a problem strictly to this kind of research.
OK, this makes sense !