FeatureCounts output does not contain all the genes in the annotation file
2
1
Entering edit mode
4.6 years ago
devbt15 ▴ 30

Dear all, I use Subread in Bash for mapping and FeatureCounts for count quantification for my RNASeq data. I have used the same pipeline for other RNASeq datasets in the past and FeatureCounts gives count output for all the genes in the annotation file.

Recently I performed RNASeq 150 bp paired-ends reads in rice (Illumina). However, this time, after running featureCounts the count output file contained only 24000 genes out of total 42000 genes in the annotation file. I cannot understand why is it so? Does it suggest that mapping failed for the 18000 ignored genes? Does it suggest PCR amplification-based duplicated reads?

Can someone explain why is it so?

Regards.

FeatureCounts RNASeq Rice Bash • 5.4k views
ADD COMMENT
1
Entering edit mode

Hi, Did you finally find out why it is so? I am running into the same problem with my own data and I can't find where this comes from...

ADD REPLY
0
Entering edit mode

do you mean that those genes are completely missing from the output or that they have 0 count ?

ADD REPLY
1
Entering edit mode

Thank you for the response. They are completely missing from the output. I also ran HTSeq-counts and it also did not provide the count data for all the genes.

ADD REPLY
0
Entering edit mode

Please provide code.

ADD REPLY
0
Entering edit mode

Here it is:

Subread index: subread-buildindex -F -o subreadindex ricegen.fa

Annotation file: gffread -E riceann.gff3 -T -o riceann.gtf

Subread mapping: subread-align -a riceann.gtf -F GTF -T 8 --sortReadsByCoordinates --multiMapping -i subreadindex -t 0 -r Q1read1.fq.gz -R Q1read2.fq.gz -o Q1.bam

FeatureCounts: featureCounts -p -T 8 -G ricegen.fa -t exon -g gene_id -a riceann.gtf -F GTF -M -O --fraction -o STRcounts.txt Q1.bam Q2.bam Q3.bam V1.bam V2.bam V3.bam

HTSeq-counts: htseq-count -t exon -i gene_id -f bam -m intersection-nonempty Q1.bam Q2.bam Q3.bam V1.bam V2.bam V3.bam riceann.gtf > result_file.txt

ricegen.fa and riceann.gtf are the genome and annotation files used.

ADD REPLY
1
Entering edit mode

Did you check the chromosome names in reference vs GTF?

ADD REPLY
0
Entering edit mode

Yes, the names are the same. Anyways it would have been highlighted while running subread if it was not the same.

ADD REPLY
0
Entering edit mode

is RNAseq library prep strand specific?

ADD REPLY
0
Entering edit mode

Yes, it is strand-specific.

ADD REPLY
0
Entering edit mode

How is read coverage for missing genes per exon (for eg take one and check)? Please also post featurecounts summary, if possible.

ADD REPLY
0
Entering edit mode

Thanks, I will check the bigwig on the genome browser and let you know.

ADD REPLY
0
Entering edit mode

Here is the FeatureCounts output screenshot image for the 6 bam files as input in the code above: https://ibb.co/PFf9WxQ

I also checked the read coverage for gene IDs missing from the FeatureCounts output vs annotation file. Here is the screenshot image: https://ibb.co/SmD3cK2

ADD REPLY
0
Entering edit mode

Can you also post the snapshot of the gtf file?

ADD REPLY
0
Entering edit mode

Please find it here

https://ibb.co/KK1kwb3

ADD REPLY
0
Entering edit mode

Hi viDub, Unfortunately not. Since I have little time left for this work, I shifted my analysis from SubRead to Salmon based reference transcriptome analysis where I do not need the FeatureCounts or HTSeqCount. I am sorry, but I cannot help you anymore at this point in time. Regards.

ADD REPLY
1
Entering edit mode
3.9 years ago
el.dansch ▴ 10

Did you check if the missing genes have a CDS annotated, or do they maybe have missing gene_ids (or whatever you specified to count) in the GTF file?

ADD COMMENT
0
Entering edit mode
2.2 years ago
Beáta • 0

I ran into the same problem. Apparently the missing genes are all single-exon genes, represented in the gtf file by a single row, as "gene" only, no exons. Since featurecounts in default mode specifies "exon" as the feature for counting, these genes were left out of the counts file. If the feature is specified as "gene", these "no-exon-specified" genes are also counted (metafeature can be left as "gene_id"). I would recommend counting in two steps: first, by default setting (exon as feature), second, by "gene" for feature. Extract the counts for the "no-exon-specified" genes from the second counts file, and concatenate with the first counts file.

ADD COMMENT
0
Entering edit mode

Not immediately convinced this is a valid approach.

in my book, counting on 'gene' will also include reads that map in introns for instance (because the gene is from start gene to stop gene). As a consequence you will over-count, as typically you do not want to include reads in introns as expression signal.

In any way it's clear this is a GFF/GTF issue as even those single exon genes should have an exon (and CDS) annotated & included in the GFF file

Perhaps try to run something like AGAT on it , it may be able fix this.

ADD REPLY
0
Entering edit mode

That is a poor GTF file then...

ADD REPLY

Login before adding your answer.

Traffic: 3871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6