Hello everyone,
I am working with human WGS data at 30X depth, and I am preprocessing FASTQ files for alignment. The sequencing platform used is NovaSeq X+, and the adapter is TruSeq. After researching, I found that Fastp is a convenient tool for this task (https://github.com/OpenGene/fastp)
What I Did: I tested Fastp on one sample for trimming and quality filtering of reads. Below are the raw FASTQC results before preprocessing:
**FASTQC Summary for Raw Reads:** \
Read 1:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
PASS Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\
PASS Adapter Content
Read 2:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
PASS Sequence Length Distribution\
PASS Sequence Duplication Levels\
WARN Overrepresented sequences
PASS Adapter Content
**FASTQC Summary After Preprocessing:** \
Processed Read 1:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
WARN Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\
Processed Read 2:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
WARN Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\
PASS Adapter Content\
**Fastp Command Used**:\
fastp \
-i /path/to/R1.fastq.gz \
-I /path/to/R2.fastq.gz \
-o /path/to/out1.fastq.gz \
-O /path/to/out2.fastq.gz \
--adapter_fasta TruSeq_adapter.fasta \
-g \
--poly_g_min_len 8 \
--qualified_quality_phred 15 \
--unqualified_percent_limit 30 \
--n_base_limit 3 \
--average_qual 20 \
--thread 8
The overrepresented sequences warning for Read 2 was resolved after preprocessing. However, a Sequence Length Distribution warning appeared in both reads after preprocessing. I believe this is due to poly-G trimming, which is expected with the parameters I used.
Questions:
- Should I keep the current Fastp options, or do you recommend adjustments?
- Will the trimming of poly-G regions or the Sequence Length Distribution warning affect downstream steps like duplicate marking?
PS: The final goal of this work is to search for germline variants.
Thank you for your guidance!
It is possible that your data could have a problem with phantom calls described in this thread. Please go through the thread fully. You may want to try this tool out on your data. New Illumina error mode, new BBTools release (39.09) to deal with it