Hi,
I came across the analysis of RNA-seq data, and the trimming of it makes me wonder a bit. We would usually trim the first 10 - 15 base pairs given, the graphic. However, with this, RNA STAR maps only to ~60 % unique reads. If we remove the trimming of the first 10 - 15 base pairs, the unique mapped reads are above 90%! Even if we remove just 1 base pair in the trimming, the quality of the unique mapped reads drops down to 64%. Can someone explain to me why even a removal of 1 basepair changes the result that much? Given how mappers work, with k-mers and suffix arrays, I am surprised the removal of 1 basepair changes the result that much.
Also, why do we have such a non-uniform distribution at the reads for the first 10 - 15 base pairs? What makes them so special that they are actually very important?
Thanks a lot!
Could be helpful if you provide the STAR output summary (e.g. after trimming, what do these reads then get classified as?)
See my other comment, but also the full statistics: STAR output for removingno base pair with fastp (-f 0 and -F 0):
STAR output for removing one 1 base pair with fastp (-f 1 and -F 1):
STAR output for removing 12 base pair with fastp (-f 12 and -F 12):
Output for TopHat2 on non-removal:
Output for TopHat2 on 12 base-pair removal: