How to recover 5'UTR, CDS, start codon after GTF merging ?
1
0
Entering edit mode
4.7 years ago
nlehmann ▴ 150

Hi all,

I built a new annotation file out of long reads data with StringTie. When I opened the resulting GTF file, I noticed that all features except exons and transcripts disappeared. So we lose all the data on 5' or 3'UTR, CDS, start and stop codons.

I wonder if merging them would be a good idea to recover this type of data (at least for the genes that have not been modified by StringTie). Do you know of any tool that could do that ? I tried to merge them with cuffmerge and gffcompare. None of the two give the results that I would expect (a merged file with data on exons, CDS, UTR...).

Here is a sample of the reference file I used (where there was data on UTR, CDS...):

> cat ref_olig2.gtf
chr1    ncbiRefSeq  transcript  106522741   106524545   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1";  gene_name "OLIG2";
chr1    ncbiRefSeq  exon    106522741   106522781   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; exon_id "NM_001031526.1.1"; gene_name "OLIG2";
chr1    ncbiRefSeq  5UTR    106522741   106522781   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; exon_id "NM_001031526.1.1"; gene_name "OLIG2";
chr1    ncbiRefSeq  exon    106523018   106524545   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1    ncbiRefSeq  5UTR    106523018   106523036   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1    ncbiRefSeq  CDS 106523037   106523930   .   +   0   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1    ncbiRefSeq  3UTR    106523934   106524545   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1    ncbiRefSeq  start_codon 106523037   106523039   .   +   0   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";
chr1    ncbiRefSeq  stop_codon  106523931   106523933   .   +   0   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; exon_id "NM_001031526.1.2"; gene_name "OLIG2";

Here is the same region in the new GTF (to make it simple, I chose a region that has not been modified by StringTie):

> cat stringtie_olig2.gtf
chr1    ncbiRefSeq  transcript  106522741   106524545   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; gene_name "OLIG2"; ref_gene_id "OLIG2";
chr1    ncbiRefSeq  exon    106522741   106522781   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "1"; gene_name "OLIG2";
chr1    ncbiRefSeq  exon    106523018   106524545   .   +   .   gene_id "OLIG2"; transcript_id "NM_001031526.1"; exon_number "2"; gene_name "OLIG2";

Result of gffcompare (only transcript and exons):

> gffcompare stringtie_olig2.gtf ref_olig2.gtf
> cat gffcmp.combined.gtf
chr1    ncbiRefSeq  transcript  106522741   106524545   .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; gene_name "OLIG2"; oId "NM_001031526.1"; tss_id "TSS1";
chr1    ncbiRefSeq  exon    106522741   106522781   .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "1";
chr1    ncbiRefSeq  exon    106523018   106524545   .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "2";

Result of cuffmerge (only exons):

> cuffmerge -g ref.olig2.gtf list_cuffmerge.txt
> cat list_cuffmerge.txt
stringtie_olig2.gtf 
> cat merged_asm/merged.gtf
    chr1    Cufflinks   exon    106522741   106522781   .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "OLIG2"; oId "NM_001031526.1"; nearest_ref "NM_001031526.1"; class_code "="; tss_id "TSS1";
    chr1    Cufflinks   exon    106523018   106524545   .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "OLIG2"; oId "NM_001031526.1"; nearest_ref "NM_001031526.1"; class_code "="; tss_id "TSS1";
gffcompare cuffmerge stringtie annotation gtf • 1.7k views
ADD COMMENT
0
Entering edit mode
4.7 years ago
Juke34 8.9k

You can give a try with ‘agat_sp_merge_annotations.pl’ from AGAT

ADD COMMENT
0
Entering edit mode

Thanks a lot, it's working fine with AGAT tool ! Sorry for the delay in replying.

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6