Question

Per Sequence GC Content fail

0

Entering edit mode

6 months ago

Eduardo ▴ 20

Hello, I am new to NGS sequencing quality control and facing the Per Sequence GC Content fail issue. Is it a contamination problem? This data is exome sequencing from one patient (same patient in two lanes and paired ends), using Illumina NovaSeq 6000 and Illumina DNA Prep with Enrichment kit. All quality control software used were Falco and MultiQC.

I did a QC removing the first 15 bases in 5' of reads and the last three bases at 3' position using Fastp. After QC, I noticed some parameters improved: Per base sequence content and Overrepresented sequences. However, Per Sequence GC Content is still failing, and I am worried it might be a contamination. Is it a problem hindering this patient's pathogenic variant screening? Can anyone help me?

fastqc • 1.7k views

ADD COMMENT • link 4 months ago by Eduardo ▴ 20

1

Entering edit mode

I think it is too much to expect that anyone will follow this Google drive link, download the file, unzip it and go through it just so they can help you. You can help us help you by posting relevant images here. For example, I have a FastQC report where I right-hand clicked and saved a GC-content image, and here it is:

GC-quality

ADD REPLY • link 6 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Hello, sorry for not posting the images. I will post the GC plot before and after QC respectively.

Sequence GC Content before QC

Sequence GC Contenta after QC

ADD REPLY • link 6 months ago by Eduardo ▴ 20

1

Entering edit mode

Hopefully it is clear that your GC-plot is similar to the one I posted. Nothing was wrong with my dataset and it is likely that your dataset is fine, as GenoMax suggested.

ADD REPLY • link 6 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Hello, I am deeply grateful for the help. I analyzed the fastq files in fastqscreen, and I could not find any contamination. I also tried filtering for only human genome hits (to test), and even after this, the GC content is still failing. I think some issue might happened in the library or tagmentation or another step. What do you think?

Multiqc for fastqscreen report with samples (L001 = lane 1, L002 = lane 2, R1 = forward, R2 = reverse) before and after qc (fastq files with "fastp" suffixes are files that have undergone fastp QC):

Falco GC content report for a fastq file (with QC) filtered for human genome only:

Falco GC content report

ADD REPLY • link 5 months ago by Eduardo ▴ 20

1

Entering edit mode

the GC content is still failing.

As already indicated please don't get hung up on tests "failing" in FastQC. Unless you see a clear problem with your final results these test should be used as a point along the way.

ADD REPLY • link 5 months ago by GenoMax 153k

0

Entering edit mode

Thanks, I did an germline pipeline with nextflow Sarek.

ADD REPLY • link 5 months ago by Eduardo ▴ 20

0

Entering edit mode

I do not know a good parameter for a fraction of reference (%) covered in the exome. I tried an alignment test and got this plot from Qualimap:

Qualimap plot

Sorry for asking so many questions. It is my first time doing a whole bioinformatics pipeline alone.

ADD REPLY • link 5 months ago by Eduardo ▴ 20

1

Entering edit mode

This plot is not informative since it likely is using the entire genome. You should use a BED file for the exome kit you used and then calculate coverage as shown here --> How to calculate the coverage of list of genes in whole exome data

ADD REPLY • link 5 months ago by GenoMax 153k

0

Entering edit mode

Thanks for showing me another way to see coverage. I am going to try this too. I did a Nexflow pipeline (Sarek) and obtained a better result, including BED from the exome kit. I also included the bed file (intervals) in Qualimap to see the coverage and fraction of reference. I got this:

Genome coverage

What do you think? Is this a good coverage?

ADD REPLY • link 4 months ago by Eduardo ▴ 20

score 4 · Accepted Answer · 2025-02-13

One or more test "failing" in FastQC is not an immediate reason to stop moving forward with the analysis. Those results need to be considered in context of the kind of data one is using. QC intervals used for FastQC are set up for plain genomic sequencing and any other type (e.g. exome) could lead to observations like yours.

Move forward with the rest of the analysis and circle back if you notice any major issues in your results.