Entering edit mode
7 months ago
marco.barr
▴
150
Hi everyone, I'm attempting to downsample a fastq file to retain only 20% of the reads. I'm using seqtk with the command: seqtk sample -s 11000 file.fastq 0.2 > downsample_file.fastq
. However, it seems to be doing the opposite, filtering out 20% instead. Did I make an error in the command? Should I use 0.8 instead? Thank you for your assistance
It should not do that. Wild suggestion, but try using something like
100
for the seed value - maybe the large seed value is causing some sort of unexpected bug. I know it doesn't make sense but give it a shot.I followed your advice and it seems that I'm getting results comparable to what I was getting before. Upon checking with
wc -l
on the original R1.fastq file, I have 298949 lines, while in the downsampled file I even have more lines, 584432. How is this possible? Should I use reformat.sh since it's a paired-end file? Thanks for the adviceThe file being PE should not matter. I'd recommend you open an issue on the seqtk github repo as this is starting to look like some sort of niche bug.
I understood where the problem lies. I discussed with my wet lab colleagues (this is part of a bioinformatician's job...) and by showing them the results, we realized that they had made errors in DNA extraction, which affected everything else despite the fastq appearing 'clean'. Unfortunately, the saying 'garbage in, garbage out' always holds true... Thank you, Ram, for your advice.
maybe try using seqkit instead?
I think the problem might be specifying proportion and fixed numbers in the same command line you used. Instead, please try this-
OR if you want a fixed number of reads
OP is not specifying both. The
11000
is the seed value, not the number of reads.I'm moving this to a comment for now.