Nanopore data filtering using fastp
0
0
Entering edit mode
4 weeks ago

Hello, I'm running the following command in my terminal:

fastp -i user/nanopore/inputs/ -o user/nanopore/outputs/ -A -G --qualified_quality_phred 10 --reads_to_process 5000 --thread 8

Although I'm specifying the number of reads to keep as 5000, after running my pipeline, I find that my fastq files have different numbers of reads such as 4965, 4778, and 4653, but never exactly 5000. Does anyone have an idea on how to fix this problem, please?

fastp nanopore • 762 views
ADD COMMENT
0
Entering edit mode

How many reads do you have in each file to start with? And do you know how many reads pass your filter threshold? Maybe you don't have any more than is being emitted.

You could remove the --reads_to_process flag and see how many you get if this function is indeed broken on your system. Then downsample with other tools like seqkit.

ADD REPLY
0
Entering edit mode

here is an exmaple of my summary file of my different fastq files. As you can see I have more that 5000 read that have --qualified_quality_phred > 10 enter image description here

ADD REPLY
2
Entering edit mode

My understanding is that you want to downsample your sample and output 5000 reads that meet the filtering criteria, or the longest 5000 reads that meet the criteria. However, the '--reads_to_process 5000' parameter of fastp will only read the first 5000 reads, filter them, and output those that meet the filtering criteria.

ADD REPLY
0
Entering edit mode

What I want is to randomly retrieve 5000 reads from each Fastq file. At the end of this process, I aim to have 11 Fastq files, each containing 5000 reads with a quality score greater than 10.

Do you know how to dot it please ? which tool to use ?

ADD REPLY
0
Entering edit mode

My suggestion would to filter the fastqs without the --reads_to_process flag. Then you can downsample after with a tool like seqkit using the split2 command.

ADD REPLY
0
Entering edit mode

thank you for this advice , but when reading the documentation of split2 I found that it generates multiple files for the same fastq file. My goal is to select 5000 random reads from one fastq file not multiple fastq file for one fastq file.

ADD REPLY
0
Entering edit mode

Ah, sorry, I was thinking of the wrong command. You can either use seqtk sample or seqkit sample. I would recommend the former if your files are particularly large.

ADD REPLY
0
Entering edit mode

I tested seqtk sample but I generate a wird result. Can I share my code and results with you please ?

ADD REPLY
0
Entering edit mode

Share code and an example please.

ADD REPLY
0
Entering edit mode

thank you , this is my code :

import sys import os import subprocess

class FastpFiltering: def __init__(self, input_folder, fastp_output_folder, seqtk_output_folder, num_threads, qualified_quality_phred, reads_to_process): self.input_folder = input_folder self.fastp_output_folder = fastp_output_folder self.seqtk_output_folder = seqtk_output_folder self.num_threads = num_threads self.qualified_quality_phred = qualified_quality_phred self.reads_to_process = reads_to_process

def filter_reads(self):
    # Create the output folders if they don't exist
    if not os.path.exists(self.fastp_output_folder):
        os.makedirs(self.fastp_output_folder)
    if not os.path.exists(self.seqtk_output_folder):
        os.makedirs(self.seqtk_output_folder)

    # Get a list of all FASTQ files in the input folder
    fastq_files = [f for f in os.listdir(self.input_folder) if f.endswith('.fastq') or f.endswith('.fastq.gz')]

    # Run Fastp and Seqtk for each FASTQ file
    for fastq_file in fastq_files:
        input_path = os.path.join(self.input_folder, fastq_file)
        fastp_output_path = os.path.join(self.fastp_output_folder, fastq_file)
        seqtk_output_path = os.path.join(self.seqtk_output_folder, fastq_file)

        # Run Fastp
        fastp_cmd = f'fastp -i {input_path} -o {fastp_output_path} -A -G --qualified_quality_phred {self.qualified_quality_phred} --thread {self.num_threads}'
        subprocess.run(fastp_cmd, shell=True)

        # Run Seqtk sample to randomly retrieve reads
        seqtk_cmd = f'seqtk sample -s100 {fastp_output_path} {self.reads_to_process} | gzip > {seqtk_output_path}'
        subprocess.run(seqtk_cmd, shell=True)

Example usage

input_folder = sys.argv1 fastp_output_folder = sys.argv2 # Output folder for results generated by Fastp seqtk_output_folder = sys.argv[3] # Output folder for results generated by Seqtk num_threads = int(sys.argv[4]) qualified_quality_phred = int(sys.argv[5]) reads_to_process = int(sys.argv[6])

fastp_filtering = FastpFiltering(input_folder, fastp_output_folder, seqtk_output_folder, num_threads, qualified_quality_phred, reads_to_process) fastp_filtering.filter_reads()

this is how my summary stat look like for the fastq files generated after fastp command : enter image description here

this is how they looks like after seqtk sample -s100 command. As you can the seqtk sample -s100 take as input the results generated by fastp. I chose 10 as quality and 5000 as number of reads.

  • if we take the example of the first fastq file : SRRxxxxxxx8 the number of reads that came out after fastp and have a quality > 10 is 19693. Logically speaking I think that we should have 5000 reads that have a quality > 10 which not the case in the second table. Is that normal ?? NB : I use Nanoplot to generate these statistics

enter image description here

ADD REPLY
0
Entering edit mode

You can also use reformat.sh from BBMap suite. Look at sampling options.

ADD REPLY
0
Entering edit mode

thank you !!

ADD REPLY

Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6