Reduce the number of PE reads by half
2
0
Entering edit mode
8.3 years ago
BioGeek ▴ 170

I would like to reduce the number of PE reads by half (and keep both in two different files). Is there any quick way to achieve it?

Assembly NGS PE Reads • 1.7k views
ADD COMMENT
0
Entering edit mode

is this a random sampling of 50% of the reads? Or a 'down-the-middle' split?

ADD REPLY
1
Entering edit mode

if all you want is the 'first 50% of the reads' in the file without random sampling, you can (1) count the number of reads in the fastq: cat your.fastq | echo $((wc -l/4)) (2) divide the number of reads by 2 (3) multiply this number by 4 to get the number of lines you need, and then (4) head -n #lines to get the first 50% of the sequences you need. (6) use tail to get the bottom 50%

ADD REPLY
0
Entering edit mode

you can use seqtk sample function

ADD REPLY
0
Entering edit mode

If you know Python, you can use HTSeq for subsampling, but to get the other half would half to follow @genomax2's suggestion of find the reads by header that didn't end up in your subsamples files. Here's the example on seqanswers.

ADD REPLY
3
Entering edit mode
8.3 years ago
GenoMax 147k

reformat.sh from BBMap.

reformat.sh in1=read1.fq.gz in2=read2.fq.gz out1=new1.fq.gz out2=new2.fq.gz samplerate=0.5
ADD COMMENT
0
Entering edit mode

Thanks for your reply. I guess, it extract the reads "randomly". Now, how to extract the remaining 50% ?

ADD REPLY
1
Entering edit mode

I think you will need to grab the ID's of reads that got selected in first round and then use filterbyname.sh from BBMap to get the rest in separate files.

ADD REPLY
1
Entering edit mode
8.3 years ago
igor 13k

A few options: Selecting Random Pairs From Fastq?

ADD COMMENT

Login before adding your answer.

Traffic: 2556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6