Question

Help with the reads filtering process

0

Entering edit mode

2.5 years ago

valentinavan ▴ 50

Hi,

I have some metagenomic whole genome sequencing pair-end data obtained with a NovaSeq Illumina machine and a fixed length of 150bp. The sequencing has been done by a private company and the data came already trimmed and filtered but they do not pass all of the fastQC tests so I performed my self some additional cleaning steps. However, after all of this, I end up with some reads longer than 150bp and now I am confused. Is this ok?

Here what I have done, step-by-step:

Trim NEXTERA sequences with Trimmomatic

for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//'); do trimmomatic PE -phred33 ${FILE}_R1.fastq ${FILE}_R2.fastq path/${FILE}_trimmed_1.fastq path/${FILE}_unpaired_1.fastq path/${FILE}_trimmed_2.fastq path/${FILE}_unpaired_2.fastq ILLUMINACLIP:path/nextera.fa.txt:2:30:10; done

Remove overrepresented G-polymers with bbduk

for FILE in $(ls *_trimmed_1.fastq | sed 's/_trimmed_1.fastq//'); do bbduk.sh in1=${FILE}_trimmed_1.fastq in2=${FILE}_trimmed_2.fastq out1=${FILE}_bbduk_1.fastq out2=${FILE}_bbduk_2.fastq entropy=0.5 entropywindow=50 entropyk=5; done

Remove duplicate reads with seqkit

for FILE in $(ls *.fastq | sed 's/.fastq//'); do seqkit rmdup -s ${FILE}.fastq > ${FILE}.unique.fastq; done

Merge pair ends

for FILE in $(ls *.fastq | sed 's/.fastq//'); do pear -j 4 -n 30 -f ${FILE}_1.unique.fastq -r ${FILE}_2.unique.fastq -o ${FILE}.merged.unique.fastq; done

Any help about theory/codes/tools would be more than appreciated.

Thanks

here below the plot of the reads length

enter image description here

filtering trim • 890 views

ADD COMMENT • link 2.5 years ago by valentinavan ▴ 50

2

Entering edit mode

but they do not pass all of the fastQC tests

There is no rule that says all FastQC test need to pass before you can move on. Limits in FastQC are set for normal genomic sequence and it is normal for one or more test to fail. So always keep the context of your experiment in mind as you look at FastQC results.

If your data is indeed trimmed and cleaned (make sure the reads are in sync across R1/R2 files, if they are not you will need to ask the company for unprocessed data) go on with your metagenomic analysis.

Instead of hopping around in different programs much of what you are doing can be done in BBTool suite using bbmerge.sh, bbduk.sh and clumpify.sh.

ADD REPLY • link 2.5 years ago by GenoMax 147k

1

Entering edit mode

You merge two 150bp reads and are surprised they are longer than 150bp? Do you understand what merging is?