Question

FeatureCounts output does not contain all the genes in the annotation file

1

Entering edit mode

4.5 years ago

devbt15 ▴ 30

Dear all, I use Subread in Bash for mapping and FeatureCounts for count quantification for my RNASeq data. I have used the same pipeline for other RNASeq datasets in the past and FeatureCounts gives count output for all the genes in the annotation file.

Recently I performed RNASeq 150 bp paired-ends reads in rice (Illumina). However, this time, after running featureCounts the count output file contained only 24000 genes out of total 42000 genes in the annotation file. I cannot understand why is it so? Does it suggest that mapping failed for the 18000 ignored genes? Does it suggest PCR amplification-based duplicated reads?

Can someone explain why is it so?

Regards.

FeatureCounts RNASeq Rice Bash • 5.2k views

ADD COMMENT • link updated 24 months ago by ATpoint 85k • written 4.5 years ago by devbt15 ▴ 30

1

Entering edit mode

Hi, Did you finally find out why it is so? I am running into the same problem with my own data and I can't find where this comes from...

ADD REPLY • link 4.4 years ago by viDub ▴ 10

0

Entering edit mode

do you mean that those genes are completely missing from the output or that they have 0 count ?

ADD REPLY • link 4.5 years ago by lieven.sterck 15k

1

Entering edit mode

Thank you for the response. They are completely missing from the output. I also ran HTSeq-counts and it also did not provide the count data for all the genes.

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

Please provide code.

ADD REPLY • link 4.5 years ago by ATpoint 85k

0

Entering edit mode

Here it is:

Subread index: subread-buildindex -F -o subreadindex ricegen.fa

Annotation file: gffread -E riceann.gff3 -T -o riceann.gtf

Subread mapping: subread-align -a riceann.gtf -F GTF -T 8 --sortReadsByCoordinates --multiMapping -i subreadindex -t 0 -r Q1read1.fq.gz -R Q1read2.fq.gz -o Q1.bam

FeatureCounts: featureCounts -p -T 8 -G ricegen.fa -t exon -g gene_id -a riceann.gtf -F GTF -M -O --fraction -o STRcounts.txt Q1.bam Q2.bam Q3.bam V1.bam V2.bam V3.bam

HTSeq-counts: htseq-count -t exon -i gene_id -f bam -m intersection-nonempty Q1.bam Q2.bam Q3.bam V1.bam V2.bam V3.bam riceann.gtf > result_file.txt

ricegen.fa and riceann.gtf are the genome and annotation files used.

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

1

Entering edit mode

Did you check the chromosome names in reference vs GTF?

ADD REPLY • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

Yes, the names are the same. Anyways it would have been highlighted while running subread if it was not the same.

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

is RNAseq library prep strand specific?

ADD REPLY • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

Yes, it is strand-specific.

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

How is read coverage for missing genes per exon (for eg take one and check)? Please also post featurecounts summary, if possible.

ADD REPLY • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

Thanks, I will check the bigwig on the genome browser and let you know.

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

Here is the FeatureCounts output screenshot image for the 6 bam files as input in the code above: https://ibb.co/PFf9WxQ

I also checked the read coverage for gene IDs missing from the FeatureCounts output vs annotation file. Here is the screenshot image: https://ibb.co/SmD3cK2

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

Can you also post the snapshot of the gtf file?

ADD REPLY • link 4.5 years ago by Hyper_Odin ▴ 320

0

Entering edit mode

Please find it here

https://ibb.co/KK1kwb3

ADD REPLY • link 4.5 years ago by devbt15 ▴ 30

0

Entering edit mode

Hi viDub, Unfortunately not. Since I have little time left for this work, I shifted my analysis from SubRead to Salmon based reference transcriptome analysis where I do not need the FeatureCounts or HTSeqCount. I am sorry, but I cannot help you anymore at this point in time. Regards.

ADD REPLY • link 4.4 years ago by devbt15 ▴ 30

score 1 · Answer 1 · 2021-03-08

1

Entering edit mode

3.7 years ago

el.dansch ▴ 10

Did you check if the missing genes have a CDS annotated, or do they maybe have missing gene_ids (or whatever you specified to count) in the GTF file?

ADD COMMENT • link 3.7 years ago by el.dansch ▴ 10

score 0 · Answer 2 · 2022-11-25

0

Entering edit mode

2.0 years ago

Beáta • 0

I ran into the same problem. Apparently the missing genes are all single-exon genes, represented in the gtf file by a single row, as "gene" only, no exons. Since featurecounts in default mode specifies "exon" as the feature for counting, these genes were left out of the counts file. If the feature is specified as "gene", these "no-exon-specified" genes are also counted (metafeature can be left as "gene_id"). I would recommend counting in two steps: first, by default setting (exon as feature), second, by "gene" for feature. Extract the counts for the "no-exon-specified" genes from the second counts file, and concatenate with the first counts file.

ADD COMMENT • link 2.0 years ago by Beáta • 0

0

Entering edit mode

Not immediately convinced this is a valid approach.

in my book, counting on 'gene' will also include reads that map in introns for instance (because the gene is from start gene to stop gene). As a consequence you will over-count, as typically you do not want to include reads in introns as expression signal.

In any way it's clear this is a GFF/GTF issue as even those single exon genes should have an exon (and CDS) annotated & included in the GFF file

Perhaps try to run something like AGAT on it , it may be able fix this.

ADD REPLY • link 24 months ago by lieven.sterck 15k

0

Entering edit mode

That is a poor GTF file then...

ADD REPLY • link 24 months ago by ATpoint 85k