Question

Fastp Trimming for WGS

0

Entering edit mode

23 days ago

j.k3096 • 0

Hello everyone,

I am working with human WGS data at 30X depth, and I am preprocessing FASTQ files for alignment. The sequencing platform used is NovaSeq X+, and the adapter is TruSeq. After researching, I found that Fastp is a convenient tool for this task (https://github.com/OpenGene/fastp)

What I Did: I tested Fastp on one sample for trimming and quality filtering of reads. Below are the raw FASTQC results before preprocessing:

**FASTQC Summary for Raw Reads:** \
Read 1:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
PASS Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\
PASS Adapter Content

Read 2:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
PASS Sequence Length Distribution\
PASS Sequence Duplication Levels\
WARN Overrepresented sequences
PASS Adapter Content


**FASTQC Summary After Preprocessing:** \
Processed Read 1:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
WARN Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\

Processed Read 2:\
PASS Basic Statistics\
PASS Per base sequence quality\
WARN Per tile sequence quality\
PASS Per sequence quality scores\
PASS Per base sequence content\
WARN Per sequence GC content\
PASS Per base N content\
WARN Sequence Length Distribution\
PASS Sequence Duplication Levels\
PASS Overrepresented sequences\
PASS Adapter Content\

**Fastp Command Used**:\
fastp \
  -i /path/to/R1.fastq.gz \
  -I /path/to/R2.fastq.gz \
  -o /path/to/out1.fastq.gz \
  -O /path/to/out2.fastq.gz \
  --adapter_fasta TruSeq_adapter.fasta \
  -g \
  --poly_g_min_len 8 \
  --qualified_quality_phred 15 \
  --unqualified_percent_limit 30 \
  --n_base_limit 3 \
  --average_qual 20 \
  --thread 8

The overrepresented sequences warning for Read 2 was resolved after preprocessing. However, a Sequence Length Distribution warning appeared in both reads after preprocessing. I believe this is due to poly-G trimming, which is expected with the parameters I used.

Questions:

Should I keep the current Fastp options, or do you recommend adjustments?
Will the trimming of poly-G regions or the Sequence Length Distribution warning affect downstream steps like duplicate marking?

PS: The final goal of this work is to search for germline variants.

Thank you for your guidance!

fastp fastq preprocessing WGS • 390 views

ADD COMMENT • link updated 22 days ago by GenoMax 148k • written 23 days ago by j.k3096 • 0

1

Entering edit mode

The sequencing platform used is NovaSeq X+

It is possible that your data could have a problem with phantom calls described in this thread. Please go through the thread fully. You may want to try this tool out on your data. New Illumina error mode, new BBTools release (39.09) to deal with it

ADD REPLY • link 22 days ago by GenoMax 148k

score 2 · Accepted Answer · 2024-11-29

For trimming of paired end reads I prefer to use default parameters of Fastp. They are great. I don't even specify the FASTA file with adapters, because for paired end reads Fastp is able to detect adapters without known sequence. The parameters that you used are also good (but only if the file TruSeq_adapter.fasta contains correct adapter sequences).

The module "Sequence Length Distribution" of FastQC raises a warning if not all reads are of the same length (see https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html). This is expected for trimmed reads (even if you don't do poly-G trimming), nothing to worry about.