Question

Downsampling fastq file

0

Entering edit mode

8 months ago

marco.barr ▴ 160

Hi everyone, I'm attempting to downsample a fastq file to retain only 20% of the reads. I'm using seqtk with the command: seqtk sample -s 11000 file.fastq 0.2 > downsample_file.fastq. However, it seems to be doing the opposite, filtering out 20% instead. Did I make an error in the command? Should I use 0.8 instead? Thank you for your assistance

downsample fastq • 1.2k views

ADD COMMENT • link 7 months ago by marco.barr ▴ 160

1

Entering edit mode

It should not do that. Wild suggestion, but try using something like 100 for the seed value - maybe the large seed value is causing some sort of unexpected bug. I know it doesn't make sense but give it a shot.

ADD REPLY • link 8 months ago by Ram 44k

0

Entering edit mode

I followed your advice and it seems that I'm getting results comparable to what I was getting before. Upon checking with wc -l on the original R1.fastq file, I have 298949 lines, while in the downsampled file I even have more lines, 584432. How is this possible? Should I use reformat.sh since it's a paired-end file? Thanks for the advice

ADD REPLY • link 7 months ago by marco.barr ▴ 160

0

Entering edit mode

The file being PE should not matter. I'd recommend you open an issue on the seqtk github repo as this is starting to look like some sort of niche bug.

ADD REPLY • link 7 months ago by Ram 44k

0

Entering edit mode

I understood where the problem lies. I discussed with my wet lab colleagues (this is part of a bioinformatician's job...) and by showing them the results, we realized that they had made errors in DNA extraction, which affected everything else despite the fastq appearing 'clean'. Unfortunately, the saying 'garbage in, garbage out' always holds true... Thank you, Ram, for your advice.

ADD REPLY • link 7 months ago by marco.barr ▴ 160

0

Entering edit mode

maybe try using seqkit instead?

ADD REPLY • link 8 months ago by eebloom ▴ 90

0

Entering edit mode

I think the problem might be specifying proportion and fixed numbers in the same command line you used. Instead, please try this-

#1. Downsample a fraction of reads 
seqtk sample file1.fastq 0.2 > file1_sub1.fastq
seqtk sample file2.fastq 0.2 > file2_sub1.fastq

OR if you want a fixed number of reads

#2. Downsample a fixed number 
seqtk sample file1.fastq 20000 > file1_sub1.fastq
seqtk sample file2.fastq 20000 > file2_sub1.fastq

ADD REPLY • link updated 8 months ago by Ram 44k • written 8 months ago by bk11 ★ 3.0k

0

Entering edit mode

OP is not specifying both. The 11000 is the seed value, not the number of reads.