I have some metagenomic whole genome sequencing pair-end data obtained with a NovaSeq Illumina machine and a fixed length of 150bp. The sequencing has been done by a private company and the data came already trimmed and filtered but they do not pass all of the fastQC tests so I performed my self some additional cleaning steps. However, after all of this, I end up with some reads longer than 150bp and now I am confused. Is this ok?
Here what I have done, step-by-step:
Trim NEXTERA sequences with Trimmomatic
for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//'); do trimmomatic PE -phred33 ${FILE}_R1.fastq ${FILE}_R2.fastq path/${FILE}_trimmed_1.fastq path/${FILE}_unpaired_1.fastq path/${FILE}_trimmed_2.fastq path/${FILE}_unpaired_2.fastq ILLUMINACLIP:path/nextera.fa.txt:2:30:10; done
Remove overrepresented G-polymers with bbduk
for FILE in $(ls *_trimmed_1.fastq | sed 's/_trimmed_1.fastq//'); do bbduk.sh in1=${FILE}_trimmed_1.fastq in2=${FILE}_trimmed_2.fastq out1=${FILE}_bbduk_1.fastq out2=${FILE}_bbduk_2.fastq entropy=0.5 entropywindow=50 entropyk=5; done
Remove duplicate reads with seqkit
for FILE in $(ls *.fastq | sed 's/.fastq//'); do seqkit rmdup -s ${FILE}.fastq > ${FILE}.unique.fastq; done
Merge pair ends
for FILE in $(ls *.fastq | sed 's/.fastq//'); do pear -j 4 -n 30 -f ${FILE}_1.unique.fastq -r ${FILE}_2.unique.fastq -o ${FILE}.merged.unique.fastq; done
Any help about theory/codes/tools would be more than appreciated.
here below the plot of the reads length
There is no rule that says all FastQC test need to
before you can move on. Limits in FastQC are set for normal genomic sequence and it is normal for one or more test tofail
. So always keep the context of your experiment in mind as you look at FastQC results.If your data is indeed trimmed and cleaned (make sure the reads are in sync across R1/R2 files, if they are not you will need to ask the company for unprocessed data) go on with your metagenomic analysis.
Instead of hopping around in different programs much of what you are doing can be done in BBTool suite using
.You merge two 150bp reads and are surprised they are longer than 150bp? Do you understand what merging is?
Yes my bad, that is very obvious. Thanks for pointing it out to me