Why 50bp Illumina run produces 51bp reads? Do I have to remove one extra base before starting the analysis?
Why 50bp Illumina run produces 51bp reads? Do I have to remove one extra base before starting the analysis?
I highly recommend removing the extra base. In all of my tests, the extra base (101 on a 100bp run, 151 on a 150bp run, etc) had an extremely high error rate, typically 5x that of the previous base - and the quality score of the last base is not very accurate, so quality-trimming will not reliably remove it - and fastQC will not reliably indicate whether or not it needs to be trimmed, either. For that reason, I put a feature in BBDuk, forcetrimmodulo
, just to remove that last base. For example:
bbduk.sh in=reads.fq out=trimmed.fq forcetrimmodulo=5
That trims all of the reads so that the length is equal to zero modulo 5, meaning that 51bp reads will be trimmed to 50bp; 76 will be trimmed to 75; 101 will be trimmed to 100; etc. If the reads are already (for example) 100bp long, then nothing will happen. In the processing order internal to bbduk, this forcetrimmodulo
(or ftm for short) trimming happens before adapter trimming (if that is enabled), which is important because trimming the last extra base makes it easier to identify adapter sequence due to the lower error rate. As a result I always enable ftm when doing adapter-trimming.
I've posted various graphs showing the huge drop in quality for the extra base, but you can generate them yourself if you want, like this:
bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qist=qhist.txt reads=10m
Then plot the mhist and qhist to see what the true accuracy of the bases is, by location.
Edit: Actually, I should have just followed Daniel's link :)
It depends on the base mask used. See: http://seqanswers.com/forums/showthread.php?t=61997 for a good, but brief discussion on why you might want to drop the n+1th base
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
reviving an old post to quickly add a "For future reference/users" message:
If your data has been processed with bcf2fastq (with adapter trimming activated, which for some reason seems to be quite common recently) , you will get reads of all different lengths so not only eg 151.
To compensate this (and for me personally the more general approach) I prefer to use the
ftr=<read-length>
in stead offtm=5
. The latter will for instance also trim back reads of 78 to 75 and then you unnecessarily trim those back as well. Usingftr
will trim back to the specified length but leave the shorter ones untouched.Why is this +1 cycle even necessary?