Entering edit mode
8.3 years ago
BioGeek
▴
170
I would like to reduce the number of PE reads by half (and keep both in two different files). Is there any quick way to achieve it?
is this a random sampling of 50% of the reads? Or a 'down-the-middle' split?
if all you want is the 'first 50% of the reads' in the file without random sampling, you can (1) count the number of reads in the fastq:
cat your.fastq | echo $((
wc -l/4))
(2) divide the number of reads by 2 (3) multiply this number by 4 to get the number of lines you need, and then (4)head -n #lines
to get the first 50% of the sequences you need. (6) usetail
to get the bottom 50%you can use seqtk sample function
If you know Python, you can use HTSeq for subsampling, but to get the other half would half to follow @genomax2's suggestion of find the reads by header that didn't end up in your subsamples files. Here's the example on seqanswers.