Filtering Fastq based on GC content
1
0
Entering edit mode
12 weeks ago
hpapoli ▴ 150

Hello,

In my Illumina NovaSeq read, I have many G and C homopolymer reads. I used fastp --trim_poly_g option.

However, this option detects reads with at least 10 Gs at the end and trims the 10 Gs. If the whole read is made up of Gs, those reads still stay there but will only be 10 base pairs shorted. In addition, if G homopolymers appear in the middle of reads, this filtering option does not remove them.

I can easily imagine to write a python script to filter reads based on GC% but given I have 300 million reads, it will probably take forever to finish the job.

Is there any way you would suggest for doing this filtering in an efficient way?

Example before filtering from sample 1 Example after filtering from sample 1

fastq • 329 views
ADD COMMENT
2
Entering edit mode
12 weeks ago
GenoMax 148k

Use polyfilter.sh, a new tool available in BBMap suite: New Illumina error mode, new BBTools release (39.09) to deal with it

ADD COMMENT

Login before adding your answer.

Traffic: 2248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6