Question

GTF merging with gffcompare

0

Entering edit mode

4 weeks ago

Mai.Nabil • 0

Hello everyone, I have a bunch of gtf files for some RNA samples "PacBio Sequell II" from the mouse encode project. I was trying to get some counts for transcripts to perform some differential expression and usage testing. However, I noticed that basically novel transcripts have identifiers per sample library identifier. So, I was trying to produce a unified gtf file where identical novel transcripts receive a unified identifier across all samples so that I can generate my counts effectively. I used gffcompare for the merge and it was successful with identical transcripts where you have identical everything "start, end, exon number, exon coordinates, and intron chain". My question is, I noticed that gffcompare is merging some transcripts where they have identical intron chains but the length is different due to the end of the terminal exon "3' end" being different among them. Are these actually biologically distinct transcripts? Should I work with them as one transcript or consider keeping them as distinct isoforms? if this is the case how to do that? here is an example to illustrate the problem: q3:ENSMUSG00000028284.13|ENCLB795KSJT000172206|----> length:964 q4:ENSMUSG00000028284.13|ENCLB483YFTT000202727|-----> length:3804 q6:ENSMUSG00000028284.13|ENCLB728YOBT000208749|-----> length:2196

they all have identical "start, exon structure, exon coordinates, splice junctions" they only differ in the end location of the terminal exon.

Thank you for helping out.

gtf merging annotation RNA-Long reads transcript • 340 views

ADD COMMENT • link updated 4 weeks ago by Carlo Yague 8.9k • written 4 weeks ago by Mai.Nabil • 0

score 0 · Answer 1 · 2024-10-19

0

Entering edit mode

4 weeks ago

Carlo Yague 8.9k

Are these actually biologically distinct transcripts?

Yes ! Transcripts with different 3' ends can be biologically distinct. Alternative 3'end processing (sometimes called APA, alternative polyadenylation) can have profound implication on the life of the RNA (stability, export, translation...) even if the coding sequence is not affected.

Should I work with them as one transcript or consider keeping them as distinct isoforms?

Good question and generally, its up to you. It depends if you want to do gene-level analysis or transcript-level analysis. Gene level analysis can be good enough to capture general transcriptional signature, i.e., gene induction/repression in response to genetic or environmental changes. But merging transcript at the gene level make you miss transcript-level information (differential transcript usage, APA, ...).

how to do that?

To do gene-level analysis, you can aggregate the counts based on the gene identifier regardless of the isoform.

ADD COMMENT • link 4 weeks ago by Carlo Yague 8.9k

0

Entering edit mode

well, thank you so much for your response. I did some research and apparently the problem is that terminal exons contain UTR, so when I looked at a bunch of these transcripts it looked like they have different UTRs but all have the same CDS sequence of the last exon. since my research is based on junction analysis I may combine them as one transcript and this shouldn't be a problem strictly to this kind of research.