Hi all,
I've been browsing the literature for RNA-Seq QC recommendations and have largely come to the conclusion that I should avoid read trimming and/or quality filtering or else I face introducing bias into any transcript expression estimates, especially as my data is pretty good. However, there doesn't seem to be such a clear consensus on a good minimum read length except to say that overly short reads can cause spurious alignment. How would you define an overly short read? My original read length is 100bp, so would a minimum of 50bp be a good length? I do have a few shorter reads from adaptor removal and PhiX filtering and while I don't think there's many of them, I want to make sure I'm minimizing any introduced bias.
Thank you!
I am not sure that is the consensus.
I've read in a few papers, such as by Williams et al (2016) and McManes (2014) that doing anything but a gentle trim could introduce a level of bias into gene expression estimates, although this can somewhat be mitigated with read length filtering. My data is generally good but I will be working with data from a range of sources and so I am still on the fence about whether or not to do a gentle qtrim of the data.
The Williams paper uses Q40 cutoff and a minimum length of 1. Although technically possible, I would not consider those cutoffs reasonable.
The MacManes paper removed 25% of the dataset with trimming, which is fairly aggressive. However, adapter trimming was performed regardless of the PHRED cutoff.