Increasing per base G content in QC sequencing files
1
0
Entering edit mode
9 months ago
robertsr • 0

Hello,

I conducted some metagenomic sequencing as follows:

  • Metagenomic sequencing from human stool samples
  • PCR-free library prep using NEB kit (450bp insert)
  • Illumina NovaSeq X Plus sequencing (150bp paired-end) -96 samples multiplexed on 1 lane

I have just got back the QC results and the Q30 scores look good (87-90% for all samples). However the base content along the reads looks strange for some samples whereby the G content begins to increase at the end of both reads whilst C content (and sometime A/T content) begins to decrease. See some photos attached of a mixture of different samples. This occurs only for some samples, whilst others remain relatively stable in GC/AT content.

Should I be worried about this? Any advice would be helpful.

Thanks!

enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here

FastQC GC-content QC NovaSeq • 747 views
ADD COMMENT
0
Entering edit mode

Should you be worried? Likely not. Keep it at the back of your mind. Proceed with the rest of the analysis. If there is something amiss then backtrack to figure out.

ADD REPLY
0
Entering edit mode
9 months ago

Mind that for the Illumina NovaSeq X Plus chemistry, "absence of signal" is base-called as G. Therefore, the increasing G signal towards the end of the read is presumably indicative of a biological/technical issue that resulted in clusters being lost to detection. After n cycles (~100), these clusters likely entirely dropped out, resulting in reads that exhibit G-homopolymers at the 3' end.

This could be due to (wild-speculation) adapter dimers in your library, particularly if low-input samples are predominantly affected? To understand better what the issue might be, I think you should look specifically at affected reads and at the sequences preceding the dropout. Not tested, but BBDuk should be able to extract those:

bbduk.sh in="read1.fastq.gz" in2="read2.fastq.gz" outm="read1_polyG.fastq.gz" outm2="read1_polyG.fastq.gz" \
                        stats="sample.stats" \
                        k=23 \
                        literal="GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" \
                        hammingdistance=1 \
                        removeifeitherbad=t \
                        pratio=G,C \
                        plen=30

Afterwards, you can use clumpify.sh to order them based on similarity for easier inspection?

You can also use BBDuk to filter/trim/mask the affected reads before proceeding to downstream analysis. Just use out= instead of outm= to save the cleaned reads instead of the affected ones.

ADD COMMENT
0
Entering edit mode

I second the short insert suggestion. These are 150 bp reads and considering this is metagenomic sequencing it is possible that you are simply looking at the poly-G's that show up once the adapter is read through. Normal trimming of the data should also get rid of these if that is true.

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6