Question

featureCounts exon counting problem

1

Entering edit mode

9.5 years ago

Nicolas Rosewick 11k

Hi,

I try to count the number of reads at the exon level using featurecounts. When I try this on a bam file aligned using STAR:

featureCounts -f -C -s 1 -p -t exon -g gene_id -a ensembl.gtf -o exon_count.txt align.bam

it gives me strange results. For example in IGV I can clearly see read alignment on BRCC3 but 0 read count for every BRCC3 exons in featureCounts output.. Is my command correct ?

exon featurecounts • 13k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 9.5 years ago by Nicolas Rosewick 11k

2

Entering edit mode

I wonder if the fact that there are duplicate exon entries is what's causing this. Try adding the -R option and redirect stdout to a file to track how alignments are getting treated.

ADD REPLY • link 9.5 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-12-07

Hey NicoBxl,

I'm also doing per exon counts using featureCounts. This is the command I ran

featureCounts -T 14  -a my_gtf_file.gft  -o my_output.txt -p -s 2 -g exon_id  my_bam_file.bam

featureCounts v1.4.6-p2

And it worked for me.. (at least I think it did) I then manually collapsed all exons into genes and compared the counts with featureCounts per gene (default) and (looking by eye) gene counts from collapsed exons counts had slightly fewer reads compare to default featureCounts counts, but we reasoned that if the read is ambiguous (for example spans exon-intron junction) featureCounts discards it, therefore less reads when counting per exon.

From my understanding:

-t option is for counting against your particular feature of interest e.g three_prime_utr. That option really sets the coordinates rage. "Is your read in that coordinates range..?" Most often we are interested in "Is my read in the exon coordinates range..?" hence the default

-g option is the crucial one, it sets your bins. "How do you want to bin your reads..?" In exons please.., but you are saying bin them into genes.

You can grab any line from GTF file (I'm only going to speak for GTF file) and in every line you'll have gene_id attribute, in most you'll have transcript_id and in some you'll have exon_id. Obviously exons go into transcript and transcripts go into gene and so you can only group reads into the same level or higher, but not lower. In other words if you specify -t transcript you can only group by -g transcript_id or -g gene_id (default) but you cannot group by -g exon_id. On the other hand if you specify -t exon (default) you can group by -g exon_id or -g transcript_id or -g gene_id. And with three_prime_utr they only have transcript_id and gene_id attributes (make sense, because utr belongs to the transcript).

-f option (I think, haven't tested it out yet) is for some internal featureCounts use. The meta-feature term is something featureCounts people came up themselves. From the docs

We define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes.

I didn't use it when I was counting reads per exon, but I'm pretty sure now that I should have. In my case I left -f off so featureCounts must have counted reads per meta-feature i.e gene and then some how broken them into exon_id groups.. I guess if I'd have specified -f if would have counted right from the star.. Although I'm a little confused about -f option, because given that you want to group by exons it wouldn't make sense to summarise against genes.. would it..? I'd like to more about -f option myself.

Cheers,

score 3 · Answer 2 · 2016-07-22

Hi,

I know I'm a bit late to answer, but I had the same/similar problem, largely answered by Devon's comment:

I wonder if the fact that there are duplicate exon entries is what's causing this. Try adding the -R option and redirect stdout to a file to track how alignments are getting treated.

did you try using the -O option ?

Assign reads to all their overlapping meta-features (or features if -f is specified).

In my case, this option restored ~71.6% assigned, compared to 27.8% assigned without the option.

Just showing the relevant options:

Overlap exons, assign genes: /home/kruealbr/bin/featureCounts -s 1 -T 12 -J --primary --ignoreDup -p
- 69.7% fragments assigned
Overlap exons, assign exons : /home/kruealbr/bin/featureCounts -f -s 1 -T 12 -J --primary --ignoreDup -p
- 27.8%
Overlap exons, assign all exons: /home/kruealbr/bin/featureCounts -f -O -s 1 -T 12 -J -G /work/kruealbr/genomes/grch37/84/fasta/Homo_sapiens.GRCh37.dna.primary_assembly.fa --primary --ignoreDup -p
- 71.6%

Of course, given the description, one has to be careful how the count data is then processed, as each read may be counted multiple times.

Best wishes, Kevin