Recently I thought I was finally settling on a pipeline for human & mouse RNA-Seq using STAR for alignment, Gencode as an annotation source, FeatureCounts for read counting, and RSEM for transcript quantification. However, just yesterday we noticed that a common marker gene, SOX10, was showing up as having zero counts in all human samples but was well above zero when looking at the RSEM TPM, and definitely had mapped reads (which were not multimapped) when looking at the BAM files in IGV and some of the reads in the SAM files. It turns out that the Gencode annotation for hg38 (at least v21 and v22, "Comprehensive gene annotation" GTF) has an overlapping putative not-well-annotated feature that overlaps the entire length of the SOX10 gene, preventing reads from being counted to SOX10 when using default settings in HTSeq-Count and FeatureCounts. This is kind of scary, since we don't know how many other "real" genes this might be happening to.
What are some suggestions for how to do deal with this? I imagine default settings for the counting programs have a very good reason for being set to exclude counting ambiguous reads by default, so I'm on the fence about just enabling counting of overlapping features and running with it. I could also take a subset of the GTF and remove this one overlapping feature that's causing a problem, or remove all the "overlapping features" that don't appear to be well supported.
Thanks!
Do you not have a stranded dataset? The only really overlapping entry is on the opposite strand.
Not stranded to my knowledge, for a while it cost more so our core did non-stranded libraries as standard. I'll double check on that. It seems then that a better practice would be to use a "sparser" annotation set than Gencode when using a non-strand-specific library?
For this particular case, the UCSC annotation would work a bit better. Having said that, UCSC annotations are typically problematic and should be avoided. The long-term solution is to use stranded datasets, the cost difference is really small these days. In the short term, the only reliable way around losing counts in cases like this is with an EM algorithm like RSEM or eXpress and then summing the expected counts across transcripts in a gene. You'd then use those in limma::voom or edgeR (DESeq2 won't accept them).
Thanks, I was going to try to avoid remapping but I guess I might have to. Running infer_experiment from reseqc to try to check on the strand specificity (or lack thereof).
Edit:
Settles that one.
This would be an easier choice if we hadn't already spent a lot of time with the existing differential expression results.
For what it's worth, this sort of thing should only lead to false-negatives and not an appreciable number of false-positives.* I know it's still not ideal, but at least the time/resources you've spent haven't really been wasted.
*In theory at least, but we don't live in a world of spherical cows...