ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
4
4
Entering edit mode
4.7 years ago

Hi,

I am trying to use featureCounts to analyse my RNA-seq data with Apis mellifera. My Code and error are as follows. r

/softwares/subread-2.0.0-source/bin/featureCounts 
-T 16 
-p 
-s 1 
-a /home/axel/arumoyc/alignment/GCF_003254395.2_Amel_HAv3.1_genomic.gtf 
-t exon
 -g gene_id 
-o /home/axel/arumoyc/counts_all/all/count24.txt /home/axel/arumoyc/bamfiles_test/bamfiles/bamfile24/map24Aligned.sorted.out.bam
 2> /home/axel/arumoyc/counts_all/all/count24.screen-output.log

ERROR:

failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'geneid' An example of attributes included in your GTF annotation is 'geneid ""; transcriptid "unknowntranscript1"; anticodon "(pos:31..33)"; gbkey "tRNA"; product "tRNA-Glu"; exonnumber "1"; 'The program has to terminate.

The .gtf file was downloaded from NCBI and was not manipulated. Please help me on this error. Thanks in advance.

featureCounts • 15k views
ADD COMMENT
0
Entering edit mode

Have you looked at the contents of your provided GTF file? With the -g option on featureCounts you're telling it to look for the identifier you proided. Most likely your GTF file is either missing the identifier you provided or is using a different name.

ADD REPLY
0
Entering edit mode

I see the "gene_id" is present. I also converted the .gtf file into an excel table also and the 9th column is indeed "gene_id". Each column of the .gtf file has 335791 entries.

#gtf-version 2.2
#!genome-build Amel_HAv3.1
#!genome-build-accession NCBI_Assembly:GCF_003254395.2
#!annotation-source NCBI Apis mellifera Annotation Release 104
NC_037638.1     Gnomon  gene    9273    12174   .       -       .       gene_id "LOC551580"; db_xref "BEEBASE:GB42195"; db_xref "GeneID:551580"; gbkey "Gene"; gene "LOC551580"; gene_biotype "protein_coding";
NC_037638.1     Gnomon  exon    11812   12174   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705491.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 57 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X3"; exon_number "1";
NC_037638.1     Gnomon  exon    11054   11121   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705491.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 57 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X3"; exon_number "2";
NC_037638.1     Gnomon  exon    10913   10994   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705491.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 57 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X3"; exon_number "3";
NC_037638.1     Gnomon  exon    9779    9827    .       -       .       gene_id "LOC551580"; transcript_id "XR_001705491.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 57 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X3"; exon_number "4";
NC_037638.1     Gnomon  exon    9274    9546    .       -       .       gene_id "LOC551580"; transcript_id "XR_001705491.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 57 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X3"; exon_number "5";
NC_037638.1     Gnomon  exon    11579   12174   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705490.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 65 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X2"; exon_number "1";
NC_037638.1     Gnomon  exon    11054   11121   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705490.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 65 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X2"; exon_number "2";
NC_037638.1     Gnomon  exon    10913   10994   .       -       .       gene_id "LOC551580"; transcript_id "XR_001705490.2"; db_xref "GeneID:551580"; gbkey "misc_RNA"; gene "LOC551580"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 65 samples with support for all annotated introns"; product "ubiquitin-related modifier 1, transcript variant X2"; exon_number "3";
ADD REPLY
0
Entering edit mode

I seem to be having the same issue with a .gtf file downloaded from NCBI: GCF_003339765.1_Mmul_10_genomic.gtf

$ head GCF_003339765.1_Mmul_10_genomic.gtf 

#gtf-version 2.2
#!genome-build Mmul_10
#!genome-build-accession NCBI_Assembly:GCF_003339765.1
#!annotation-source NCBI Macaca mulatta Annotation Release 103
NC_041754.1     Gnomon  gene    8796    27366   .       -       .       gene_id "PGBD2"; db_xref "GeneID:114678393"; gbkey "Gene"; gene "PGBD2"; gene_biotype "protein_coding";
NC_041754.1     Gnomon  exon    26570   27366   .       -       .       gene_id "PGBD2"; transcript_id "XM_028848769.1"; db_xref "GeneID:114678393"; gbkey "mRNA"; gene "PGBD2"; model_evidence "Supporting evidence includes similarity to: 36 ESTs, 2 Proteins, 5 long SRA reads, and 89% coverage of the annotated genomic feature by RNAseq alignments, including 69 samples with support for all annotated introns"; product "piggyBac transposable element derived 2"; exon_number "1";
NC_041754.1     Gnomon  exon    13491   13554   .       -       .       gene_id "PGBD2"; transcript_id "XM_028848769.1"; db_xref "GeneID:114678393"; gbkey "mRNA"; gene "PGBD2"; model_evidence "Supporting evidence includes similarity to: 36 ESTs, 2 Proteins, 5 long SRA reads, and 89% coverage of the annotated genomic feature by RNAseq alignments, including 69 samples with support for all annotated introns"; product "piggyBac transposable element derived 2"; exon_number "2";
NC_041754.1     Gnomon  exon    8796    10763   .       -       .       gene_id "PGBD2"; transcript_id "XM_028848769.1"; db_xref "GeneID:114678393"; gbkey "mRNA"; gene "PGBD2"; model_evidence "Supporting evidence includes similarity to: 36 ESTs, 2 Proteins, 5 long SRA reads, and 89% coverage of the annotated genomic feature by RNAseq alignments, including 69 samples with support for all annotated introns"; product "piggyBac transposable element derived 2"; exon_number "3";
NC_041754.1     Gnomon  CDS     13491   13507   .       -       0       gene_id "PGBD2"; transcript_id "XM_028848769.1"; db_xref "GeneID:114678393"; gbkey "CDS"; gene "PGBD2"; product "piggyBac transposable element-derived protein 2"; protein_id "XP_028704602.1"; exon_number "2";
NC_041754.1     Gnomon  CDS     9005    10763   .       -       1       gene_id "PGBD2"; transcript_id "XM_028848769.1"; db_xref "GeneID:114678393"; gbkey "CDS"; gene "PGBD2"; product "piggyBac transposable element-derived protein 2"; protein_id "XP_028704602.1"; exon_number "3";

I am aware from this guidance (https://mblab.wustl.edu/GTF2.html) and this biostars question (GFF3 to GTF conversion - 9th column) that gene_id and transcript_id must be at the start of the 9th column and I tried to correct my .gtf file based on recommendations in the biostar post, however, this hasn't solved the issue.

I have another annotation downloaded from Ensembl and it runs fine however it's not the annotation I want to use.

Further clarification on this would be greatly appreciated and thank you for you time reading this

ADD REPLY
0
Entering edit mode

I was having the same error message. I tried to manipulate my gtf file in many ways. I read someone suggesting to go to an older version of subread and that worked for me. We happened to have subread/1.5.1. Good luck.

ADD REPLY
0
Entering edit mode

You can try AGAT it might fix your problem

ADD REPLY
5
Entering edit mode
4.2 years ago
Chris S. ▴ 340

featureCounts does not allow empty values in the gene_id field, so you need to remove or update them. See this answer from the developer https://groups.google.com/g/subread/c/xs7mw38Bc6g.

grep 'gene_id ""' GCF_003254395.2_Amel_HAv3.1_genomic.gtf 
grep -v 'gene_id ""' GCF_003254395.2_Amel_HAv3.1_genomic.gtf > Amel_HAv3.1_FIXED.gtf
ADD COMMENT
0
Entering edit mode

Thanks, Chris S. It seemed to work for me. Since I am not well versed with grep, please kindly tell me (us) how did you fix the GTF file? And where from could we learn more about handling this type of issues. Thanks a lot.

ADD REPLY
1
Entering edit mode
3.4 years ago
rependo ▴ 40

To anyone finding this thread after getting a related "gene_id" error in featureCounts or subread-align, I had this same issue but it was not resolved by clearing empty values following "gene_id" field in column nine (my gtf did not have empty values).

I was, however, able to resolve the issue and get subread to function (subread-align) by removing "exon_number" fields from column 9 that were present in addition to "transcript_id" and "gene_id". So if you're like me and still trying to get your .gtf to play nicely with subread, I'd suggest looking for anything in column 9 that isn't transcript_id or gene_id, then removing it.

ADD COMMENT
0
Entering edit mode

THANK YOU. This is the only thing that worked for me. for anyone else who may need it, I used the command

sed 's/\(gene_id "[^"]*\).*/\1"/' original_annotation.gtf > fixed_new_annotation.gtf

to remove everything that followed the the gene id and now feature counts runs for me.

ADD REPLY
0
Entering edit mode

I confirm that this fixed the issue also for me. Remove everything besides gene_id and transcript_id

ADD REPLY
1
Entering edit mode
3.4 years ago
onkar ▴ 10

The best way for me was converting the file into SAF format http://bioinf.wehi.edu.au/featureCounts/

GeneID  Chr Start   End Strand
497097  chr1    3204563 3207049 -
497097  chr1    3411783 3411982 +
497097  chr1    3660633 3661579 -
..

after two days of struggle to solve this I found that it was happening because 9th column didn't had "GeneID" word in the geneid section. it is very particular about the word and format

I converted the file into SAF format and hurray!! it ran perfectly. I converted to reflect only gene features, you can use according to your requirement.

grep 'gene' annot.gff |cut -d ';' -f1|tr -d ' ' |sed 's/ID=//g'|awk -v OFS='\t ''{print $9,$1,$4,$5,$7}' >annot.gff.SAF
featureCounts -T 20 -F SAF -a annot.gff.SAF -o FeatureCounts.out 1.bam 2.bam
ADD COMMENT
0
Entering edit mode
10 months ago
BioinfGuru ★ 2.1k

Edit: Problem solved: RNA-seq - Creating SAF from NCBI gff for Subread featureCounts - keep 'gene' or 'exon'

I just ran into this issue, and I think the answer by onkar is the one that most helps understanding the underlying problem. According to subread featureCounts documentation, the annotation file should be either GTF (not NCBI GFF!!) or SAF (same as example provided by onkar )

The expected 9th column GTF format is as follows:

gene_id "Em:U62317.C22.6.mRNA"; transcript_id "Em:U62317.C22.6.mRNA"; exon_number 1

It seems featureCounts accepts only either:

a) GTF: column 9 contains 'gene_id' ANYWHERE in the column (not just at the start). The following example column 9 format from biostars RNA-seq by example works fine:

  • gene_name=AAA-750000-UP-4; gene_id=AAA-750000-UP-4; transcript_id=AAA-750000-UP-4-T; exon_number=1;

b) SAF: column 1 is GeneID

Clearly column 9 from my NCBI GFF was never going to work:

ID=geneLOC100125545;Dbxref=GeneID:100125545;Name=LOC100125545;gbkey=Gene;gene=LOC100125545;gene_biotype=protein_coding

Other possible solutions:

a) Provide the featureCount option -g to explicitly tell featureCounts to look for "Dbxref=GeneID:" instead of "gene_id". I don't know if some variation of that will work but if it does, it means not having to create a complex one-off fix to parse an NCBI GFF, which is just a nuisance. (edit: tried -g Dbxref which got the program to run, but the counts.txt ouput needs alot of parsing)

b) Just get the annotation GTF from Ensembl which has the required format as suggested by @hannepainter. (edit: best option)

ADD COMMENT

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6