Entering edit mode
8 months ago
emilydolivo97
▴
10
Hello, I'm running the following command in my terminal:
fastp -i user/nanopore/inputs/ -o user/nanopore/outputs/ -A -G --qualified_quality_phred 10 --reads_to_process 5000 --thread 8
Although I'm specifying the number of reads to keep as 5000, after running my pipeline, I find that my fastq files have different numbers of reads such as 4965, 4778, and 4653, but never exactly 5000. Does anyone have an idea on how to fix this problem, please?
How many reads do you have in each file to start with? And do you know how many reads pass your filter threshold? Maybe you don't have any more than is being emitted.
You could remove the
--reads_to_process
flag and see how many you get if this function is indeed broken on your system. Then downsample with other tools likeseqkit
.here is an exmaple of my summary file of my different fastq files. As you can see I have more that 5000 read that have --qualified_quality_phred > 10
My understanding is that you want to downsample your sample and output 5000 reads that meet the filtering criteria, or the longest 5000 reads that meet the criteria. However, the '--reads_to_process 5000' parameter of fastp will only read the first 5000 reads, filter them, and output those that meet the filtering criteria.
What I want is to randomly retrieve 5000 reads from each Fastq file. At the end of this process, I aim to have 11 Fastq files, each containing 5000 reads with a quality score greater than 10.
Do you know how to dot it please ? which tool to use ?
My suggestion would to filter the fastqs without the
--reads_to_process
flag. Then you can downsample after with a tool likeseqkit
using thesplit2
command.thank you for this advice , but when reading the documentation of split2 I found that it generates multiple files for the same fastq file. My goal is to select 5000 random reads from one fastq file not multiple fastq file for one fastq file.
Ah, sorry, I was thinking of the wrong command. You can either use
seqtk sample
orseqkit sample
. I would recommend the former if your files are particularly large.I tested seqtk sample but I generate a wird result. Can I share my code and results with you please ?
Share code and an example please.
thank you , this is my code :
import sys import os import subprocess
class FastpFiltering: def __init__(self, input_folder, fastp_output_folder, seqtk_output_folder, num_threads, qualified_quality_phred, reads_to_process): self.input_folder = input_folder self.fastp_output_folder = fastp_output_folder self.seqtk_output_folder = seqtk_output_folder self.num_threads = num_threads self.qualified_quality_phred = qualified_quality_phred self.reads_to_process = reads_to_process
Example usage
input_folder = sys.argv1 fastp_output_folder = sys.argv2 # Output folder for results generated by Fastp seqtk_output_folder = sys.argv[3] # Output folder for results generated by Seqtk num_threads = int(sys.argv[4]) qualified_quality_phred = int(sys.argv[5]) reads_to_process = int(sys.argv[6])
fastp_filtering = FastpFiltering(input_folder, fastp_output_folder, seqtk_output_folder, num_threads, qualified_quality_phred, reads_to_process) fastp_filtering.filter_reads()
this is how my summary stat look like for the fastq files generated after fastp command :
this is how they looks like after seqtk sample -s100 command. As you can the seqtk sample -s100 take as input the results generated by fastp. I chose 10 as quality and 5000 as number of reads.
You can also use
reformat.sh
from BBMap suite. Look at sampling options.thank you !!