Question

Annotation problem while using Rsubread

1

Entering edit mode

6.0 years ago

gokberk ▴ 90

Hi all,

I've been trying to read some bam files using Rsubread. As for the annotation file, I converted a gff file to gtf using gffread. Here is how my gtf file looks like:

NC_012670.1 RefSeq  exon    543 614 .   +   .   transcript_id "rna76575";
NC_012670.1 RefSeq  exon    615 1562    .   +   .   transcript_id "rna76576";
NC_012670.1 RefSeq  exon    1563    1631    .   +   .   transcript_id "rna76577";
NC_012670.1 RefSeq  exon    1632    3191    .   +   .   transcript_id "rna76578";
NC_012670.1 RefSeq  exon    3192    3266    .   +   .   transcript_id "rna76579";
NC_012670.1 RefSeq  CDS 3269    4223    .   +   0   transcript_id "gene33355"; gene_id "gene33355"; gene_name "ND1";
NC_012670.1 RefSeq  exon    4224    4291    .   +   .   transcript_id "rna76580";
NC_012670.1 RefSeq  exon    4289    4360    .   -   .   transcript_id "rna76581";
NC_012670.1 RefSeq  exon    4362    4429    .   +   .   transcript_id "rna76582";
NC_012670.1 RefSeq  CDS 4430    5471    .   +   0   transcript_id "gene33356"; gene_id "gene33356"; gene_name "ND2";
NC_012670.1 RefSeq  exon    5472    5538    .   +   .   transcript_id "rna76583";
NC_012670.1 RefSeq  exon    5546    5614    .   -   .   transcript_id "rna76584";
NC_012670.1 RefSeq  exon    5616    5688    .   -   .   transcript_id "rna76585";
NC_012670.1 RefSeq  exon    5721    5782    .   -   .   transcript_id "rna76586";
NC_012670.1 RefSeq  exon    5783    5853    .   -   .   transcript_id "rna76587";
NC_012670.1 RefSeq  CDS 5858    7396    .   +   0   transcript_id "gene33357"; gene_id "gene33357"; gene_name "COX1";
NC_012670.1 RefSeq  exon    7398    7469    .   -   .   transcript_id "rna76588";
NC_012670.1 RefSeq  exon    7471    7538    .   +   .   transcript_id "rna76589";
NC_012670.1 RefSeq  CDS 7540    8223    .   +   0   transcript_id "gene33358"; gene_id "gene33358"; gene_name "COX2";
NC_012670.1 RefSeq  exon    8297    8362    .   +   .   transcript_id "rna76590";
NC_012670.1 RefSeq  CDS 8364    8570    .   +   0   transcript_id "gene33359"; gene_id "gene33359"; gene_name "ATP8";
NC_012670.1 RefSeq  CDS 8525    9205    .   +   0   transcript_id "gene33360"; gene_id "gene33360"; gene_name "ATP6";
NC_012670.1 RefSeq  CDS 9205    9988    .   +   0   transcript_id "gene33361"; gene_id "gene33361"; gene_name "COX3";

So, I run the following command in R:

rsubread.counts<-featureCounts(files=E08_bam,
                               annot.ext="/media/gokberk/DATA/Thesis/Data/Annotations/Macaca_fascicularis_5.0_genomic.gtf",
                               isGTFAnnotationFile = TRUE,
                               GTF.featureType="exon",
                               GTF.attrType="gene_id",
                               useMetaFeatures = TRUE,
                               allowMultiOverlap = FALSE)

And receive the following error:

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Macaca_fascicularis_5.0_genomic.gtf ...               ||

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
An example of attributes included in your GTF annotation is 'transcript_id "id177006"; gene_name "LOC102131469";' 
The program has to terminate.

Error in featureCounts(files = E08_bam, annot.ext = "/media/gokberk/DATA/Thesis/Data/Annotations/Macaca_fascicularis_5.0_genomic.gtf",  : 
  No counts were generated.

I was wondering what exactly is the problem with my gtf file (apparently gene_id is missing for some rows, but is that it) and how could I fix it. I'd be more than happy if you could help me.

Cheers, Gökberk

annotation gtf gff • 2.5k views

ADD COMMENT • link 6.0 years ago by gokberk ▴ 90

1

Entering edit mode

Dear Goekberg, your GTF doesn't seem to have gene annotation at all - only CDS and exons (as seen in col 3). Besides the exons not having gene_ids, the exons don't match the CDS ranges at all. This seems very odd. In addition, exons and CDS are second or third level annotation - a gene should be at the lowest level. I'd suggest you review your gff file, either the conversion was wrong or the GFF is invalid in the first place.

ADD REPLY • link 6.0 years ago by Carambakaracho ★ 3.3k