GFF3 to GTF conversion - 9th column
0
0
Entering edit mode
8.9 years ago

Hello

I am trying to convert a .gff file (version 3) downloaded from NCBI to a .gtf file, but I am having some trouble. I expected the 9th column of the outputted .gtf files to be "gene_id", but I can't get it to work.

After trying many times, I was recommended to use the .gtf files provided in the iGenomes website.

However, even though it shows me that "gene_id" is in the 9th column (see below), when I try using it in featureCounts, it shows me a warning message:

This is how the .gtf file from iGenomes looks like:

$ head genes.gtf
1    Gnomon    exon    33312    35627    .    +    .    gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1    Gnomon    CDS    34375    35558    .    +    0    gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1    Gnomon    exon    122904    125014    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    124853    125014    .    -    0    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    135820    136001    .    -    2    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    135820    136001    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    137158    137264    .    -    1    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    137158    137264    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    137834    137959    .    -    1    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    137834    137959    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";

Here is the warning outputted by featureCounts:

./featureCounts \
  -T 1 \
  -a ~/RNASeq/Bos_taurus/NCBI/UMD_3.1.1/Annotation/Genes/genes.gtf -s 2 -g gene_id \
  -t exon \
  -o charolais_${name}.featCounts.txt \
  ~/RNASeq/RNASeq_STAR/NCBI_UMD3.1.1/charolais_${name}_aligned_Aligned.sortedByCoord.out.bam

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
    v1.4.6-p5

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/undergrad1/RNASeq/RNASeq_STAR/NCBI_U ... ||
||                                                                            ||
||             Output file : charolais_145A_TGACCA.featCounts.txt             ||
||             Annotations : /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_ ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : inversed                                         ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||       Read orientations : fr                                               ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_3.1.1 ... ||

Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
The attributes included in your GTF annotation are 'gene_name "CFAP20"; p_id "P19047"; transcript_id "NM_001003905.1"; tss_id "TSS12705";'

It seems like the columns are off by one. Does anyone know a proper way to convert a .gff file to a .gtf file or a way I can fix this gtf file from iGenomes?

I've tried the gffread tool from cuffsuite, and some scripts I found online (such as in here http://blog.nextgenetics.net/?e=27) but none of them seemed to work properly. Any help is appreciated!

gtf NCBI gff3 featureCounts • 9.0k views
ADD COMMENT
1
Entering edit mode

Your GTF file doesn't conform to the definition: here they write that gene_id and transcript_id must be at the start of the 9th column: "Any other attributes or comments must appear after these two and will be ignored." By the way, this is also written in the blog post you linked to.

ADD REPLY
0
Entering edit mode

Thank you!

I realized that, however I do not know how to change it.

I expected the file downloaded from iGenomes above to be right, but apparently it is not.

Any ideas on how to edit the gtf file without messing it up?

Thanks!

ADD REPLY
4
Entering edit mode

With a perl one-liner like this:

perl -ne 'chomp; @a=split/\t/; %h=split(/ /,$a[8]); $a[8]=join(" ",("gene_id",$h{"gene_id"},"transcript_id",$h{"transcript_id"})); print join("\t",@a),"\n";' iGenome.gtf > corrected.gtf

Note that this keeps only the gene_id and transcript_id attributes. Adapt to your needs.

ADD REPLY
0
Entering edit mode

Thank you so much, Jean! Your code was really helpful!

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6