Hello,
I have lots of polyG in my data. Please see the figure attached:
I did a series of filtering using BBTools as follows:
rule adapter_trim_bbduk:
input:
fastq1="data/fastq/{sample}_R1_001.fastq.gz",
fastq2="data/fastq/{sample}_R2_001.fastq.gz"
output:
fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.adapters.fastq.gz",
fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.adapters.fastq.gz",
stats="data/fastq_bbtools_filtered/{sample}.stats.txt"
params:
adapter_fasta="data/adapters/adapters.fasta",
mem="-Xmx36g"
resources:
threads=6,
runtime="5h",
mem_mb=36000
singularity:
"containers/bbtools_39.10.sif"
shell:
"""
bbduk.sh {params.mem}\
in1={input.fastq1} \
in2={input.fastq2} \
out1={output.fastq_out1} \
out2={output.fastq_out2} \
minlen=140 \
threads={resources.threads} \
ref={params.adapter_fasta} \
stats={output.stats}
"""
rule filterbytile:
input:
fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.adapters.fastq.gz",
fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.adapters.fastq.gz"
output:
fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.filterbytile.fastq.gz",
fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.filterbytile.fastq.gz"
params:
mem="-Xmx55g"
resources:
threads=10,
runtime="5h",
mem_mb=60000
singularity:
"containers/bbtools_39.10.sif"
shell:
"""
filterbytile.sh {params.mem} in1={input.fastq1} in2={input.fastq2} out1={output.fastq_out1} out2={output.fastq_out2}
"""
rule polyfilter:
input:
fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.filterbytile.fastq.gz",
fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.filterbytile.fastq.gz",
output:
fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.polyfilter.fastq.gz",
fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.polyfilter.fastq.gz",
outb="data/fastq_bbtools_filtered/{sample}.homopolymers.fastq.gz"
params:
mem="-Xmx55g"
resources:
threads=10,
runtime="6h",
mem_mb=60000
singularity:
"containers/bbtools_39.10.sif"
shell:
"""
polyfilter.sh {params.mem} \
in1={input.fastq1} \
in2={input.fastq2} \
out1={output.fastq_out1} \
out2={output.fastq_out2} \
outb={output.outb} \
polymers=GC
"""
rule bbduk_polyg:
input:
fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.polyfilter.fastq.gz",
fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.polyfilter.fastq.gz",
output:
fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.bbdukpoly.fastq.gz",
fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.bbdukpoly.fastq.gz",
stats="data/fastq_bbtools_filtered/{sample}.bbdukpolyGstats"
params:
mem="-Xmx55g"
resources:
threads=10,
runtime="6h",
mem_mb=60000
singularity:
"containers/bbtools_39.10.sif"
shell:
"""
bbduk.sh {params.mem} \
in1={input.fastq1} {params.mem} \
in2={input.fastq2} \
out1={output.fastq_out1} \
out2={output.fastq_out2} \
minlen=140 \
threads={resources.threads} \
literal="polyG,polyC,polyGC,polyGA,polyGT" \
trimpolya=20 \
trimpolyg=20 \
trimpolyc=20 \
stats={output.stats}
"""
After filtering, the graph for GC content looks still very similiar
I really expected that the the bump around 65% to disappear. Is there something I'm missing in the filtering?
This is stdout for the program polyfilter.sh for this sample:
Filtering Time: 7326.605 seconds.
Time: 14705.665 seconds.
Reads Processed: 609m 41.45k reads/sec
Bases Processed: 92048m 6.26m bases/sec
Reads Out: 598m 98.11%
Bases Out: 90308m 98.11%
Merged Reads: 13332444 2.1871%
Poly-G Reads: 4185084 0.6865%
Poly-C Reads: 1674134 0.2746%
Low Entropy Reads: 174464 0.0286%
Low Quality Reads: 0 0.0000%
Total Discards: 11525808 1.8907%
Unique 31-mers out: 2862959804
[Tue Oct 1 03:58:39 2024]
Finished job 0.
Thanks very much!
Particularly if you are looking at NovaSeqX data, please look at this thread:
New Illumina error mode, new BBTools release (39.09) to deal with it
It varies from run to run, but on NovaSeqX, 1.5% of R2 and 0.5% of R1 being affected with artifactual poly-Gs is common. They are independent so the net rate of pairs being affected can be around 2%. These poly-Gs are not necessarily terminal (at the end of a read, so trimming would be easy) nor pure (they can have occasional A,C,T bases in the middle making them harder to detect). However, on any 2-dye Illumina platform where G is the dark base, it's important to do adapter-trimming prior to poly-G detection because short insert reads will always have terminal poly-Gs since there is no more DNA to sequence, so it goes dark... that's a totally different and unrelated error mode.
Also, I don't know why, but at least our NovaSeqX produces poly-Cs too, but at a much lower rate. Maybe 1% of the rate of poly-Gs, while poly-A and poly-T are close to zero. They still happen though.
On the data I've been playing around with for a 70% GC organism, BBDuk with k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG will exclusively remove artifactual reads and no reads that map to the genome with zero mismatches (incidentally, in the currently unreleased BBTools 39.11, you can just say "literal=polyg" because typing all those Gs got annoying for me). k=25 hdist=2 does remove a handful of reads that map to the genome allowing a couple mismatches. So to be perfectly safe I think k=29 hdist=2 is a better option (for removing artifactual reads that can't possibly be genomic). Also, in BBTools 39.10, I improved the poly-g trimming in BBDuk so that it is more tolerant of intermittent non-G bases. Also, I wrote PolyFilter, which only removes reads that have poly-Gs and also some other problem (low quality, low kmer depth, or low entropy). So what I currently recommend (for NovaSeqX, but it's designed to remove only artifactual sequence, so it should be safe on any Illumina platform) is this:
1) Adapter-trim. You must do this first or else the poly-Gs you detect will mostly be from short inserts, and do not need to be discarded. E.g.
2) Polyfilter works better on reads that have recalibrated quality scores, but that's kind of complicated, so... just run it. It will still work fine as long as you have an isolate. For a metagenome/rna-seq you really need correct quality scores though since depth is not a useful metric.
3) I'm still improving polyfilter but for BBTools 39.10 you still need a post-filtering operation to do trimming (it's integrated with polyfilter in 39.11).
Anyway, polyfilter is depth-sensitive so it's fine on isolates but could remove some important low-depth things on metagenomes and so forth. For NovaSeqX, run it on everything because any poly-G it produces is most likely artifactual. But on other platforms I would only advise using it with default settings on libraries where you expect a decent (>20x) depth. It's very robust against removing real genomic poly-Gs, but it is still possible in samples where the depth is very low (like 1x) and there are real >19 Gs in a row in the genome (rare but can happen, especially in high-GC organisms). So if you expect low depth you need to increase some of the thresholds and you probably need to test the effects (like, map the removed reads to a genome assembly and see if any of them map with no mismatches; if they do, your thresholds need to be tightened). Note that polyfilter removes BOTH reads in a pair if either fail, so if you do this testing, you need to do it on unpaired reads (int=f for interleaved files, or each file individually for paired files) because typically the mate of a removed read will map just fine.