Hi all,
I built a new annotation file out of long reads data with StringTie
. When I opened the resulting GTF file, I noticed that all features except exons and transcripts disappeared. So we lose all the data on 5' or 3'UTR, CDS, start and stop codons.
I wonder if merging them would be a good idea to recover this type of data (at least for the genes that have not been modified by StringTie). Do you know of any tool that could do that ?
I tried to merge them with cuffmerge
and gffcompare
. None of the two give the results that I would expect (a merged file with data on exons, CDS, UTR...).
Here is a sample of the reference file I used (where there was data on UTR, CDS...):
> cat ref_olig2.gtf
chr1 ncbiRefSeq transcript 106522741 106524545 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; gene_name "OLIG2";
chr1 ncbiRefSeq exon 106522741 106522781 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; exon_id "NM_001031526.1.1"; gene_name "OLIG2";
chr1 ncbiRefSeq 5UTR 106522741 106522781 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; exon_id "NM_001031526.1.1"; gene_name "OLIG2";
chr1 ncbiRefSeq exon 106523018 106524545 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1 ncbiRefSeq 5UTR 106523018 106523036 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1 ncbiRefSeq CDS 106523037 106523930 . + 0 gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1 ncbiRefSeq 3UTR 106523934 106524545 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1 ncbiRefSeq start_codon 106523037 106523039 . + 0 gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1 ncbiRefSeq stop_codon 106523931 106523933 . + 0 gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
Here is the same region in the new GTF (to make it simple, I chose a region that has not been modified by StringTie):
> cat stringtie_olig2.gtf
chr1 ncbiRefSeq transcript 106522741 106524545 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; gene_name "OLIG2"; ref_gene_id "OLIG2";
chr1 ncbiRefSeq exon 106522741 106522781 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; gene_name "OLIG2";
chr1 ncbiRefSeq exon 106523018 106524545 . + . gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; gene_name "OLIG2";
Result of gffcompare
(only transcript and exons):
> gffcompare stringtie_olig2.gtf ref_olig2.gtf
> cat gffcmp.combined.gtf
chr1 ncbiRefSeq transcript 106522741 106524545 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; gene_name "OLIG2"; oId "NM_001031526.1"; tss_id "TSS1";
chr1 ncbiRefSeq exon 106522741 106522781 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "1";
chr1 ncbiRefSeq exon 106523018 106524545 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "2";
Result of cuffmerge
(only exons):
> cuffmerge -g ref.olig2.gtf list_cuffmerge.txt
> cat list_cuffmerge.txt
stringtie_olig2.gtf
> cat merged_asm/merged.gtf
chr1 Cufflinks exon 106522741 106522781 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "OLIG2"; oId "NM_001031526.1"; nearest_ref "NM_001031526.1"; class_code "="; tss_id "TSS1";
chr1 Cufflinks exon 106523018 106524545 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "OLIG2"; oId "NM_001031526.1"; nearest_ref "NM_001031526.1"; class_code "="; tss_id "TSS1";
Thanks a lot, it's working fine with AGAT tool ! Sorry for the delay in replying.