Help with the reads filtering process
0
0
Entering edit mode
2.5 years ago
valentinavan ▴ 50

Hi,

I have some metagenomic whole genome sequencing pair-end data obtained with a NovaSeq Illumina machine and a fixed length of 150bp. The sequencing has been done by a private company and the data came already trimmed and filtered but they do not pass all of the fastQC tests so I performed my self some additional cleaning steps. However, after all of this, I end up with some reads longer than 150bp and now I am confused. Is this ok?

Here what I have done, step-by-step:

Trim NEXTERA sequences with Trimmomatic

for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//'); do trimmomatic PE -phred33 ${FILE}_R1.fastq ${FILE}_R2.fastq path/${FILE}_trimmed_1.fastq path/${FILE}_unpaired_1.fastq path/${FILE}_trimmed_2.fastq path/${FILE}_unpaired_2.fastq ILLUMINACLIP:path/nextera.fa.txt:2:30:10; done  

Remove overrepresented G-polymers with bbduk

for FILE in $(ls *_trimmed_1.fastq | sed 's/_trimmed_1.fastq//'); do bbduk.sh in1=${FILE}_trimmed_1.fastq in2=${FILE}_trimmed_2.fastq out1=${FILE}_bbduk_1.fastq out2=${FILE}_bbduk_2.fastq entropy=0.5 entropywindow=50 entropyk=5; done 

Remove duplicate reads with seqkit

for FILE in $(ls *.fastq | sed 's/.fastq//'); do seqkit rmdup -s ${FILE}.fastq > ${FILE}.unique.fastq; done

Merge pair ends

for FILE in $(ls *.fastq | sed 's/.fastq//'); do pear -j 4 -n 30 -f ${FILE}_1.unique.fastq -r ${FILE}_2.unique.fastq -o ${FILE}.merged.unique.fastq; done

Any help about theory/codes/tools would be more than appreciated.

Thanks

here below the plot of the reads length

enter image description here

filtering trim • 890 views
ADD COMMENT
2
Entering edit mode

but they do not pass all of the fastQC tests

There is no rule that says all FastQC test need to pass before you can move on. Limits in FastQC are set for normal genomic sequence and it is normal for one or more test to fail. So always keep the context of your experiment in mind as you look at FastQC results.

If your data is indeed trimmed and cleaned (make sure the reads are in sync across R1/R2 files, if they are not you will need to ask the company for unprocessed data) go on with your metagenomic analysis.

Instead of hopping around in different programs much of what you are doing can be done in BBTool suite using bbmerge.sh, bbduk.sh and clumpify.sh.

ADD REPLY
1
Entering edit mode

You merge two 150bp reads and are surprised they are longer than 150bp? Do you understand what merging is?

ADD REPLY
0
Entering edit mode

Yes my bad, that is very obvious. Thanks for pointing it out to me

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6