Hello, I'm trying to run featurecounts on my .bam files, but the resulting file yields only 0s in every row and column. Here are the steps I have taken so far:
- (de novo) Assembled 40 transcripts from RNASeq data
- Used bowtie2 to align my transcripts against a reference + generate .bam files from the fastq files. Note: The reference transcriptome does not come from the same organism, but a closely related one.
Ran featurecounts with the command:
featureCounts -a ~/NAS/new_canna_out/cannabis_reference/cannabis_ref.gff3 -g gene_id -T 10 -o ~/NAS/new_canna_out/sp_fc *.bam
This gives me the following error:
ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'Parent=transcript:evm.model.01.1;Name=evm.model.01.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=evm.model.01.1-E1;rank=1'
To counter this, I have tried giving the following command instead:
featureCounts -a ~/NAS/new_canna_out/cannabis_reference/cannabis_ref.gff3 -g gene_id -t exon_id -T 10 -o ~/NAS/new_canna_out/sp_fc *.bam
as well as:
featureCounts -a ~/NAS/new_canna_out/cannabis_reference/cannabis_ref.gff3 -g Name -T 10 -o ~/NAS/new_canna_out/sp_fc *.bam
The first option (using -t exon_id) gives the error that no features were loaded in the format GTF. The second (using -g Name) does produce a result, but with only 0s
I have checked that the reference and annotation contain the same gene IDs. I expected the -t flag to accept exon IDs (as seen in the featurecounts guide) but that doesn't work either.
What could the problem be? Thanks!
Show us a few lines from
cannabis_ref.gff3
. As error says your file likely containsParent=transcript:evm.model.01.1;Name=evm.model.01.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=evm.model.01.1-E1;rank=1'
which deos not include the fieldgene_id
.In fact only
Name
(orParent
depending what that is pointing to) may be useful.Thanks for your response! Actually, yes, some lines contain a gene_id and some contain an exon_id, as seen in this example:
I have tried running featureCounts with -g Name, but that is what yielded a result file with all 0s.
You can try using
-g ID
. That should generate counts. Since you haveexon
entries in your GTF hopefully they are tied to parentgene
entries otherwise you may get counts that span entiregene
region which will include introns.No luck, -g ID just yielded the same error, stating that featurecounts failed to find the gene identifier attirbute in the 9th column of the provided GTF file.
You can try following:
-g ID
and-t ID
? or-g Parent
and-t exon
Since you posted 3 lines of gtf, it is difficult to check other features.
Thanks for your response! I tried both options 2 and 3. Running it with -g, -t ID gives me the 'no features were loaded' error. -g Parent, -t exon gives me the same output file with no assigned alignments.
Why the backup of the gtf file, btw?
I forgot to type the third option. It seems
gene_id
field is in your gtf, but not present at required place to parse. You can try reshuffling gene_id information to place at the correct position. For that reason, I suggested to take a back up of gtf.In addition, check if gtf has same chromosome/contig information as reference (for eg chr1 vs 1) and strand information (you can get this from either core or inferexperiments.py from RSeqC). Please load the bam into IGV or any visualizer and check the coverage at place/gene of interest.
Btw, where did you get the gtf from?
I see! Yes, I'll try to make a copy of the gff with the ID replaced with gene_id. I've checked the annotation with the reference transcriptome, and they have matching gene IDs.
And I got the gff from ENSEMBL, here: http://ftp.ensemblgenomes.org/pub/plants/release-51/
I was talking chromosome notation in reference genome (genome.fa) and gtf (first column). I hope you checked if they are same. Please do not change ID just like that. There are other features starting with ID in gtf.