Hello, I am new to NGS sequencing quality control and facing the Per Sequence GC Content fail issue. Is it a contamination problem? This data is exome sequencing from one patient (same patient in two lanes and paired ends), using Illumina NovaSeq 6000 and Illumina DNA Prep with Enrichment kit. All quality control software used were Falco and MultiQC.
I did a QC removing the first 15 bases in 5' of reads and the last three bases at 3' position using Fastp. After QC, I noticed some parameters improved: Per base sequence content and Overrepresented sequences. However, Per Sequence GC Content is still failing, and I am worried it might be a contamination. Is it a problem hindering this patient's pathogenic variant screening? Can anyone help me?
I think it is too much to expect that anyone will follow this Google drive link, download the file, unzip it and go through it just so they can help you. You can help us help you by posting relevant images here. For example, I have a FastQC report where I right-hand clicked and saved a GC-content image, and here it is:
Hello, sorry for not posting the images. I will post the GC plot before and after QC respectively.
Hopefully it is clear that your GC-plot is similar to the one I posted. Nothing was wrong with my dataset and it is likely that your dataset is fine, as GenoMax suggested.
Hello, I am deeply grateful for the help. I analyzed the fastq files in fastqscreen, and I could not find any contamination. I also tried filtering for only human genome hits (to test), and even after this, the GC content is still failing. I think some issue might happened in the library or tagmentation or another step. What do you think?
Multiqc for fastqscreen report with samples (L001 = lane 1, L002 = lane 2, R1 = forward, R2 = reverse) before and after qc (fastq files with "fastp" suffixes are files that have undergone fastp QC):
Falco GC content report for a fastq file (with QC) filtered for human genome only:
As already indicated please don't get hung up on tests "failing" in FastQC. Unless you see a clear problem with your final results these test should be used as a point along the way.
Thanks, I did an germline pipeline with nextflow Sarek.
I do not know a good parameter for a fraction of reference (%) covered in the exome. I tried an alignment test and got this plot from Qualimap:
Sorry for asking so many questions. It is my first time doing a whole bioinformatics pipeline alone.
This plot is not informative since it likely is using the entire genome. You should use a BED file for the exome kit you used and then calculate coverage as shown here --> How to calculate the coverage of list of genes in whole exome data
Thanks for showing me another way to see coverage. I am going to try this too. I did a Nexflow pipeline (Sarek) and obtained a better result, including BED from the exome kit. I also included the bed file (intervals) in Qualimap to see the coverage and fraction of reference. I got this:
What do you think? Is this a good coverage?