I am somewhat new to RNAseq data and I have been using featureCounts from the subread package to summarize reads/fragments across genomic features (genes, transcripts).
In particular I am curious about what parameter choice you guys use regarding overlap. The default setting ignores reads which overlap more than one feature. However, when summarizing at the isoform level (e.g. UCSCid), this choice will ignore all reads mapping into exons shared between isoforms and lead to very low reads. At the isoform level it seems to be the better choice to use the non-default setting -O counting reads overlapping features for each feature. In a subsequent step one could choose the highest expressed isoform to represent a given gene.
At the gene level I think the default setting makes more sense. Here, however, you will sum reads across all isoforms, inflating the count of reads of any "true" single isoform or RNA species.
So what option do you usually use?
- Summarize reads across isoforms -> choose highest expressed to represent gene
- Summarize reads across genes
I just wanted to get a feeling for what others are doing regarding this choice.
The only way to get meaningful counts for isoforms is with an expectation-maximization method (e.g., Express or Flux Capacitor). There's no way around that.