Hi, I want to analyse RNA-seq data files. The pipeline used is HISAT2 > Stringtie (.gtf files) > cuffmerge (merged.gtf file). But I have an issue with the merged.gtf file produced by cuffmerge. I have two lines with the same coordinates (chromosome, start, end), width, strand, etc, except the transcript_id and the oId which are not the same. In the "contained_in" column, there is <na> for one exon and TCONS_00000003 for the other exon. Here there is an exemple of this issue but there are a lot of lines like that for others exons in this file :
seqnames start end width strand source type score phase gene_id transcript_id exon_number gene_name oId nearest_ref class_code tss_id contained_in p_id
chr1 3252757 3253236 480 + Cufflinks exon NA NA XLOC_000003 TCONS_00000004 1 Gm18956 ENSMUST00000192857.1 ENSMUST00000192857.1 = TSS3 TCONS_00000003 <NA>
4 chr1 3252757 3253236 480 + Cufflinks exon NA NA XLOC_000003 TCONS_00000003 1 Gm18956 CUFF.3.2 ENSMUST00000192857.1 = TSS3 <NA> <NA>
It seems that cuffmerge thinks that these two exons are not the same exon because the first comes from a transcript de novo (transcript_id : "CUFF.3.2") and the other comes from a known transcript (transcript_id : "ENSMUST00000192857.1").
The only message I have when I run cuffmerge is : [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file).
Do you have a solution for this issue ?
Thank you for your answers !
Thanks a lot for your answer Satya ! The issue is that these exons are part of mono-exonic transcripts so I don't understand why they are duplicated and why they have not the same transcript_id...