Why 50bp Illumina run produces 51bp long sequencs?
2
4
Entering edit mode
9.2 years ago
Tori ▴ 90

Why 50bp Illumina run produces 51bp reads? Do I have to remove one extra base before starting the analysis?

illumina next-gen • 7.1k views
ADD COMMENT
9
Entering edit mode
9.2 years ago

I highly recommend removing the extra base. In all of my tests, the extra base (101 on a 100bp run, 151 on a 150bp run, etc) had an extremely high error rate, typically 5x that of the previous base - and the quality score of the last base is not very accurate, so quality-trimming will not reliably remove it - and fastQC will not reliably indicate whether or not it needs to be trimmed, either. For that reason, I put a feature in BBDuk, forcetrimmodulo, just to remove that last base. For example:

bbduk.sh in=reads.fq out=trimmed.fq forcetrimmodulo=5

That trims all of the reads so that the length is equal to zero modulo 5, meaning that 51bp reads will be trimmed to 50bp; 76 will be trimmed to 75; 101 will be trimmed to 100; etc. If the reads are already (for example) 100bp long, then nothing will happen. In the processing order internal to bbduk, this forcetrimmodulo (or ftm for short) trimming happens before adapter trimming (if that is enabled), which is important because trimming the last extra base makes it easier to identify adapter sequence due to the lower error rate. As a result I always enable ftm when doing adapter-trimming.

I've posted various graphs showing the huge drop in quality for the extra base, but you can generate them yourself if you want, like this:

bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qist=qhist.txt reads=10m

Then plot the mhist and qhist to see what the true accuracy of the bases is, by location.

Edit: Actually, I should have just followed Daniel's link :)

ADD COMMENT
2
Entering edit mode

reviving an old post to quickly add a "For future reference/users" message:

If your data has been processed with bcf2fastq (with adapter trimming activated, which for some reason seems to be quite common recently) , you will get reads of all different lengths so not only eg 151.

To compensate this (and for me personally the more general approach) I prefer to use the ftr=<read-length> in stead of ftm=5 . The latter will for instance also trim back reads of 78 to 75 and then you unnecessarily trim those back as well. Using ftr will trim back to the specified length but leave the shorter ones untouched.

ADD REPLY
0
Entering edit mode

Why is this +1 cycle even necessary?

ADD REPLY
7
Entering edit mode
9.2 years ago
User 59 13k

It depends on the base mask used. See: http://seqanswers.com/forums/showthread.php?t=61997 for a good, but brief discussion on why you might want to drop the n+1th base

ADD COMMENT

Login before adding your answer.

Traffic: 2786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6