Question

Using BBtools to remove polyGs in Illumina data

0

Entering edit mode

7 weeks ago

hpapoli ▴ 150

Hello,

I have lots of polyG in my data. Please see the figure attached:

GC distribution before filtering

I did a series of filtering using BBTools as follows:

rule adapter_trim_bbduk:
    input:
        fastq1="data/fastq/{sample}_R1_001.fastq.gz",
        fastq2="data/fastq/{sample}_R2_001.fastq.gz"
    output:
        fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.adapters.fastq.gz",
        fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.adapters.fastq.gz",
        stats="data/fastq_bbtools_filtered/{sample}.stats.txt"
    params:
        adapter_fasta="data/adapters/adapters.fasta",
        mem="-Xmx36g"
    resources:
        threads=6,
        runtime="5h",
        mem_mb=36000
    singularity:
        "containers/bbtools_39.10.sif"
    shell:
        """
        bbduk.sh {params.mem}\
        in1={input.fastq1} \
        in2={input.fastq2} \
        out1={output.fastq_out1} \
        out2={output.fastq_out2} \
        minlen=140 \
        threads={resources.threads} \
        ref={params.adapter_fasta} \
        stats={output.stats}
        """

rule filterbytile:
    input:
        fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.adapters.fastq.gz",
        fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.adapters.fastq.gz"
    output:
        fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.filterbytile.fastq.gz",
        fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.filterbytile.fastq.gz"
    params:
        mem="-Xmx55g"
    resources:
        threads=10,
        runtime="5h",
        mem_mb=60000
    singularity:
        "containers/bbtools_39.10.sif"
    shell:
        """
        filterbytile.sh {params.mem} in1={input.fastq1} in2={input.fastq2} out1={output.fastq_out1} out2={output.fastq_out2} 
        """

rule polyfilter:
    input:
        fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.filterbytile.fastq.gz",
        fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.filterbytile.fastq.gz",
    output:
        fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.polyfilter.fastq.gz",
        fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.polyfilter.fastq.gz",
        outb="data/fastq_bbtools_filtered/{sample}.homopolymers.fastq.gz"
    params:
        mem="-Xmx55g"
    resources:
        threads=10,
        runtime="6h",
        mem_mb=60000
    singularity:
        "containers/bbtools_39.10.sif"
    shell:
        """
        polyfilter.sh {params.mem} \
        in1={input.fastq1} \
        in2={input.fastq2} \
        out1={output.fastq_out1} \
        out2={output.fastq_out2} \
        outb={output.outb} \
        polymers=GC 
        """

rule bbduk_polyg:
    input:
        fastq1="data/fastq_bbtools_filtered/{sample}_R1_001.polyfilter.fastq.gz",
        fastq2="data/fastq_bbtools_filtered/{sample}_R2_001.polyfilter.fastq.gz",
    output:
        fastq_out1="data/fastq_bbtools_filtered/{sample}_R1_001.bbdukpoly.fastq.gz",
        fastq_out2="data/fastq_bbtools_filtered/{sample}_R2_001.bbdukpoly.fastq.gz",
        stats="data/fastq_bbtools_filtered/{sample}.bbdukpolyGstats"
    params:
        mem="-Xmx55g"
    resources:
        threads=10,
        runtime="6h",
        mem_mb=60000
    singularity:
        "containers/bbtools_39.10.sif"
    shell:
        """
        bbduk.sh {params.mem} \
        in1={input.fastq1} {params.mem} \
        in2={input.fastq2} \
        out1={output.fastq_out1} \
        out2={output.fastq_out2} \
        minlen=140 \
        threads={resources.threads} \
        literal="polyG,polyC,polyGC,polyGA,polyGT" \
        trimpolya=20 \
        trimpolyg=20 \
        trimpolyc=20 \
        stats={output.stats}
        """

After filtering, the graph for GC content looks still very similiar GC after filtering

I really expected that the the bump around 65% to disappear. Is there something I'm missing in the filtering?

This is stdout for the program polyfilter.sh for this sample:

Filtering Time:                 7326.605 seconds.
Time:                           14705.665 seconds.
Reads Processed:        609m    41.45k reads/sec
Bases Processed:      92048m    6.26m bases/sec
Reads Out:              598m    98.11%
Bases Out:            90308m    98.11%
Merged Reads:       13332444    2.1871%
Poly-G Reads:        4185084    0.6865%
Poly-C Reads:        1674134    0.2746%
Low Entropy Reads:    174464    0.0286%
Low Quality Reads:         0    0.0000%
Total Discards:     11525808    1.8907%
Unique 31-mers out:             2862959804
[Tue Oct  1 03:58:39 2024]
Finished job 0.

Thanks very much!

bbtools • 506 views

ADD COMMENT • link updated 7 weeks ago by Brian Bushnell 20k • written 7 weeks ago by hpapoli ▴ 150

1

Entering edit mode

Particularly if you are looking at NovaSeqX data, please look at this thread:

New Illumina error mode, new BBTools release (39.09) to deal with it

It varies from run to run, but on NovaSeqX, 1.5% of R2 and 0.5% of R1 being affected with artifactual poly-Gs is common. They are independent so the net rate of pairs being affected can be around 2%. These poly-Gs are not necessarily terminal (at the end of a read, so trimming would be easy) nor pure (they can have occasional A,C,T bases in the middle making them harder to detect). However, on any 2-dye Illumina platform where G is the dark base, it's important to do adapter-trimming prior to poly-G detection because short insert reads will always have terminal poly-Gs since there is no more DNA to sequence, so it goes dark... that's a totally different and unrelated error mode.

Also, I don't know why, but at least our NovaSeqX produces poly-Cs too, but at a much lower rate. Maybe 1% of the rate of poly-Gs, while poly-A and poly-T are close to zero. They still happen though.

On the data I've been playing around with for a 70% GC organism, BBDuk with k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG will exclusively remove artifactual reads and no reads that map to the genome with zero mismatches (incidentally, in the currently unreleased BBTools 39.11, you can just say "literal=polyg" because typing all those Gs got annoying for me). k=25 hdist=2 does remove a handful of reads that map to the genome allowing a couple mismatches. So to be perfectly safe I think k=29 hdist=2 is a better option (for removing artifactual reads that can't possibly be genomic). Also, in BBTools 39.10, I improved the poly-g trimming in BBDuk so that it is more tolerant of intermittent non-G bases. Also, I wrote PolyFilter, which only removes reads that have poly-Gs and also some other problem (low quality, low kmer depth, or low entropy). So what I currently recommend (for NovaSeqX, but it's designed to remove only artifactual sequence, so it should be safe on any Illumina platform) is this:

1) Adapter-trim. You must do this first or else the poly-Gs you detect will mostly be from short inserts, and do not need to be discarded. E.g.

**bbduk.sh in=reads.fq out=trimmed.fq k=21 hdist=1 ref=adapters ktrim=r tbo tpe mink=11 hdist2=0 minlen=50**

2) Polyfilter works better on reads that have recalibrated quality scores, but that's kind of complicated, so... just run it. It will still work fine as long as you have an isolate. For a metagenome/rna-seq you really need correct quality scores though since depth is not a useful metric.

**polyfilter.sh in=trimmed.fq out=filtered.fq polymers=GC**

3) I'm still improving polyfilter but for BBTools 39.10 you still need a post-filtering operation to do trimming (it's integrated with polyfilter in 39.11).

**bbduk.sh in=filtered.fq out=better.fq k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG trimpolyg=6**

Anyway, polyfilter is depth-sensitive so it's fine on isolates but could remove some important low-depth things on metagenomes and so forth. For NovaSeqX, run it on everything because any poly-G it produces is most likely artifactual. But on other platforms I would only advise using it with default settings on libraries where you expect a decent (>20x) depth. It's very robust against removing real genomic poly-Gs, but it is still possible in samples where the depth is very low (like 1x) and there are real >19 Gs in a row in the genome (rare but can happen, especially in high-GC organisms). So if you expect low depth you need to increase some of the thresholds and you probably need to test the effects (like, map the removed reads to a genome assembly and see if any of them map with no mismatches; if they do, your thresholds need to be tightened). Note that polyfilter removes BOTH reads in a pair if either fail, so if you do this testing, you need to do it on unpaired reads (int=f for interleaved files, or each file individually for paired files) because typically the mate of a removed read will map just fine.

ADD REPLY • link 7 weeks ago by Brian Bushnell 20k

score 4 · Accepted Answer · 2024-10-01

4

Entering edit mode

7 weeks ago

GenoMax 147k

I have lots of polyG in my data

Are you sure because the following is saying otherwise.

Poly-G Reads:        4185084    0.6865%
Poly-C Reads:        1674134    0.2746%

bump around 65% to disappear

That is likely a reflection of a population that has different GC content than the rest of your data. If this is RNAseq data then it could be leftover rRNA (which has a higher GC content). I suggest that instead of fixating on GC plot move forward with the rest of the analysis. If you notice an obvious problem backtrack to investigate.

ADD COMMENT • link 7 weeks ago by GenoMax 147k

0

Entering edit mode

Thanks so much for your answer! These are DNA sequences from ostrich so I usually get very clean GC peak. I will then continue with the analyses but I'm very curious since it is unlikely to have contamination in this data but we do see a separate GC content so something must be going on there. Maybe I could select the reads with GC content between 60% and 80% and blast them to see if I get anything. Do you know of any tool which I could use to select reads above a certain GC content? I could also write a Python script for it but if there is one already, it'd be nicer to use that. Thanks again!

ADD REPLY • link 7 weeks ago by hpapoli ▴ 150

1

Entering edit mode

reformat.sh from BBMap will help with this. Use something like

reformat.sh -Xmx4g in1=R1.fq.gz in2=R2.fq.gz out1=filt_R1.fq.gz out2=filt_R2.fq.gz mingc=0.75

mingc=0                 Discard reads with GC content below this.
maxgc=1                 Discard reads with GC content above this.

ADD REPLY • link 7 weeks ago by GenoMax 147k

0

Entering edit mode

Wonderful! Thanks so much again!

ADD REPLY • link 7 weeks ago by hpapoli ▴ 150

0

Entering edit mode

If you like to investigate further, you could probably extract those reads with bbduk.sh ? Not tested, but something along this line should do it...

        bbduk.sh in="..." in2="..." outm="..." outm2="..." \
                    stats="..." \
                    k=23 \
                    literal="GGGGGGGGGGGGGGGGGGGGGGGGGGGGG" \
                    hammingdistance=3 \
                    removeifeitherbad=t \
                    pratio=G,C \
                    plen=20

Besides a biological origin like rRNA that GenoMax already suggested, those reads could also represent technical artifacts. In your case, the Poly G to Poly C ratio is just 5:2 and thus could be of biological origin, but I have already seen sequencing runs with a suspicious 30:1 ratio.

Particularly, the new XLEAP-SBS used on Illumina NovaSeq X Plus seems quite sensitive to various issues during library preparation. For example, adapter dimers produce reads that start with the adapter and then drop out of the sequencing. Because the absence of a signal is basecalled as G, such reads have an unusually high GC content. Should the frequency of G in your FastQC plot increase towards the later cycles, I would take that as an indicator that there might be technical artifacts involved as well.

ADD REPLY • link 7 weeks ago by Matthias Zepper 5.0k