I have a standard rna-seq dataset (125bp PE Illumina ) from a model organism. I am only doing adapter trimming and no quality trimming since the quality is excellent all the way through. There is an option in the trimming software to set minimum read len to keep. I was wondering what would be a good length and why.
My thoughts are are along these lines.
Set min length around 10-12: Would it help to keep short non coding RNAs if at all? I use ribosome depletion and not polyA capture.
Set min length around 60: Might reduce mapping time and potentially reduce multiple mapping of very short reads.
Set min length close to max length. ie; around 100 to 120: Depending on the sequence length distribution after trimming, I could potentially lose a lot of reads. Would it help with further downstream dge analysis to keep read length distribution is in a tighter range?
I could be wrong with all of these so feel free to correct me. And also some good suggestions for min length.
Shouldn't 125bp PE Illumina give you exactly 125 base reads every time? Anything less is a technical artifact or chemical problem. When my fragments are shorter than the read, the read continues into adapter sequence which must be trimmed, ultimately resulting in shorter reads, but that's got nothing to do with filtering by size up front. EDIT: your first figure shows this phenomenon, where some reads have adapter sequence on the end. So trimming everything to 80 would take care of that, at a loss of good data from the majority of the reads. Depending on your usage this may or may not be acceptable.
Actually all my raw reads are 126bp. The read length distribution I mentioned was after trimming. Trimming off the 3' adapter will result in reads of varying lengths. I can then choose to discard reads below a certain length. My question is about what this length should be?
Looking at the first figure, it would seem like the adapters start from base 78 and after trimming I should have min 78 base reads. But in reality I get all sorts of lengths. The plot probably samples a small subset of reads. And also the y-axis is %. 1-2% is hard to see. Even 1% of 30 mil reads is quite a few. Perhaps I should also mention that I am not doing a hard clip at any position. The software compares adapter sequences that I provide to only remove part of the read that matches which is why resulting lengths are variable.