I have downloaded fastq files from two completely unrelated projects that are using 16s amplicon sequencing.
Both have a weird read length distribution. Forward reads are mostly 301 bp long, as seen from awk 'NR%4==2{print length}' fw_1.fastq | sort | uniq -c | less
Reverse reads are mostly 300bp long, as seen from awk 'NR%4==2{print length}' rv_1.fastq | sort | uniq -c | less
Why would that be so? I want to analyse the datasets with DADA2 and FIGARO (FIGARO detects optimal trim parameters for DADA2) and FIGARO requires that all reads are of the same length. If I filter reads based on their length and decide to keep only reads and their pair that are, for instance, 300 bp, I will loose the vast majority of reads.
I am at a complete loss as to why would the forward and reverse reads be of mostly unequal length.
Hi,
I do not recall why the Illumina add one base at the end of the read but you could use the Force-trim modulo (
ftm
) of bbduk to remove the very last base. Here is an explanation of how this works:source
Thanks for that, it would be interesting to know why this happens, but if it's just a low quality base then I think I'll simply trim it as you say and move forward from there.
Force-trim modulo (ftm) will work if all reads are 300bp or 300bp, but if there are shorter reads (e.g. of 299bp), then ftm alone won't work, as a 299bp read will be trimmed to 295bp (ftm=5) or 290bp (ftm=10), and so on. Using the minlength (ml) parameter would help then:
ftm=10 ml=300
.Another option is to use the force-trim right parameter (zero-based), again in conjunction with ml:
ftr=299 ml=300
.Maybe it is worth opening an issue at the FIGARO repository. DADA2 doesn't have this read length limitation, it would make more sense if FIGARO didn't have too.
Thanks, I might try to do something similar with cutadapt. Actually they are already working for variable read length support in FIGARO, I think there was especially request for that feature to be able to run reads from ITS sequencing. I need to trim though for now, since it is unknown when that feature branch is going to be ready.