Hi everyone, I am programming an RNA-Seq Pipeline in R to process raw read data. I need to use multiple reference genomes from different organisms. For the mapping step I use STAR and everything worked great until the counting step with featureCounts (Rsubread). For one bacterial genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/009/445/GCA_000009445.1_ASM944v1) featureCounts is not able to assign any reads.
When I look at the count-matrix I see that there are no gene names, so I guess the problem is that featureCounts is not able to read the annotation file correctly.
I have tried to convert the .gff annotation file into a .gtf file with gffread and to use the .gtf file for counting, but I still do not get any assigned reads. (Also the .gtf file is much smaller (795 kB) than the .gff file (2.2 MB) and I don't know whether I lose important information here.)
As already mentioned in this post the problem may be the origin (NCBI) of the genome files and the ability of featureCounts to process them, but I do not use the build-in annotation ("mm10").
What do you guys think? Do I have to use a different program for the counting step for bacterial genomes?
tmp2 <- featureCounts(filesToCount, annot.ext = annGenbac, isGTFAnnotationFile = TRUE, genome = refGenbac, isPairedEnd = TRUE)
with .gff file:
Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
The attributes included in your GTF annotation are 'ID=id1;Parent=rna0;Note=ileT_1%2C len: 74 nt. First of two copies equivalent to ileT%2C len: 74 nt%2C from Mycobacterium tuberculosis strain H37Rv%2C (100.0%25 identity in 74 nt overlap). tRNA-Ile%2C anticodon gat;anticodon=(pos:10921..10923);gbkey=tRNA;gene=tRNA-Ile;product=tRNA-Ile'
|| Features : 52 ||
|| Meta-features : 1 ||
|| Chromosomes/contigs : 1 ||
|| ||
|| Loading FASTA contigs : /data/mathias/dualRNA-Seq/M_tuber/Genomes/Myco ... ||
|| 1 contigs were loaded ||
|| ||
|| Process BAM file /data/mathias/dualRNA-Seq/M_tuber/Output/Read_filteri ... ||
|| Single-end reads are included. ||
|| Assign reads to features... ||
|| Total reads : 160572 ||
|| Successfully assigned reads : 0 (0.0%) ||
|| Running time : 0.01 minutes
with .gtf file:
Features : 52 ||
|| Meta-features : 52 ||
|| Chromosomes/contigs : 1 ||
|| ||
|| Loading FASTA contigs : /data/mathias/dualRNA-Seq/M_tuber/Genomes/Myco ... ||
|| 1 contigs were loaded ||
|| ||
|| Process BAM file /data/mathias/dualRNA-Seq/M_tuber/Output/Read_filteri ... ||
|| Single-end reads are included. ||
|| Assign reads to features... ||
|| Total reads : 160572 ||
|| Successfully assigned reads : 0 (0.0%) ||
|| Running time : 0.00 minutes
Thank you for your answer! Yes, I have tried to change the
-g
option (GTF.featureType
inR
) to "gene", "Name" and "ID". I still do not get any assigned reads. The headers of my alignment files (.bam) look like that:Do you mean I have to change "
ENA|AM408590|AM408590.1
" to "AM408590.1
"?That is correct. That identifier has to match exactly the one in annotation file. You can edit the header of the SAM file (if that is what you are using to count).
Yes, that was the problem!! With changed BAM file headers...
...I finally got some assigned reads.
Thank you very much for your help!