How to randamly extract reads from a FASTQ file?
5
2
Entering edit mode
9.7 years ago
biolab ★ 1.4k

Hi everyone,

I want to randomly extract some numbers of reads from a fastq file, and save as a new smaller fastq file. Is there any perl script or shell command to achieve this?

Thank you very much!

fastq • 12k views
ADD COMMENT
0
Entering edit mode

srand is a perl function that can be used for this kind of tasks. You can write your own script using this function.

ADD REPLY
9
Entering edit mode
9.7 years ago
rtliu ★ 2.2k

How To Obtain And Install Seqtk

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq
ADD COMMENT
0
Entering edit mode

Thank you all for your helpful comments!

ADD REPLY
2
Entering edit mode
9.7 years ago
5heikki 11k
grep "^@HWI" file.fq | shuf -n 10 | grep -A 3 -x -f - file.fq | grep -v "^--$"

Fetch all lines that begin with "$HWI" (you might have to change this), randomly select 10 from these, fetch the selected lines and 3 lines below them (fastq format = header + 3 lines, header + 3 lines, ..), remove "--" lines (artifact of grep).

ADD COMMENT
1
Entering edit mode

Hi- I think this is a nice solution. But is it practical? I suspect shuf reads in memory the whole input first. Also, isn't the second grep going to be very slow for, say, a fastq file of 100M reads and a sample of 1M?

ADD REPLY
1
Entering edit mode

It's practical in the sense that you will have a ton of time to browse biostars while waiting for the command to complete. You're right though, it's very, very slow for large files.

ADD REPLY
0
Entering edit mode

My answer in this thread suggests a tool I wrote (sample) that gets around the memory limitations in shuf. If the scale of your input gives you memory errors with shuf, you might look into sample. See: How to randamly extract reads from a FASTQ file?

The sample utility uses the same sampling technique as shuf, but while shuf reads the entire file into memory, sample stores the locations of starting points for lines (or groups of lines) and samples from those locations, instead.

ADD REPLY
1
Entering edit mode
9.7 years ago

You can do that with the BBMap package:

reformat.sh in=reads.fq out=subsampled.fq samplerate=0.1

It will keep pairs together. If the pairs are in 2 files, use in1 and in2, and optionally, out1 and out2. It's much faster than a perl script.

ADD COMMENT
0
Entering edit mode

Does # work here for two files? as in in=reads_R#.fq

ADD REPLY
2
Entering edit mode

Yes, you could also write this:

reformat.sh in=read_#.fq out=subsampled_#.fq samplerate=0.1

...which would use 2 input files, read_1.fq and read_2.fq, and generate 2 output files.

ADD REPLY
1
Entering edit mode
9.7 years ago

You can sample with or without replacement with sample using a --lines-per-offset value of 4 for FASTQ files:

$ sample --sample-size=${SAMPLE_SIZE} \
         --lines-per-offset=4 \
       [ --sample-without-replacement | --sample-with-replacement ] \
         reads.fq \
       > sample.fq

Add the --preserve-order option, if you want the sample to keep reads in the same order as in the original file.

ADD COMMENT
1
Entering edit mode
9.7 years ago

If your input fastq hasn't been sorted in any particular way (i.e. it comes from the sequencer as is), you can probably assume that the reads in it are already in random order. So just get your desired number of reads straight from the fastq file. Just skip the first few hundreds of thousands of reads since these are usually bad quality:

gunzip -c reads.fq.gz \
| tail -n+1000000 \
| head  -n 4000000 \
| gzip > sampled.fq.gz

Here input and output are gzipped, first 250,000 reads skipped, output 1,000,000 reads. For testing purposes this might be good enough, quick and no dependencies.

ADD COMMENT
1
Entering edit mode

The output of the sequencer is not in a random order: it follows the spatial progression of the imaging system on the slides. As you noted the first sequences are already bad quality (because they are from a corner ?). But bad quality areas also happen in other parts of the slide, in a much less predictable manner. So your method is likely to either over-represent low-quality reads if it hits such an area, or to under-represent them if it misses the area.

ADD REPLY
0
Entering edit mode

Sure, I meant random with respect to the genome sequence or sequence content. For critical analyses I wouldn't do what I suggest. As I said, it's a quick and easy way to get a sample of reads for testing, e.g. if a pipeline works smoothly and gives sensible results.

ADD REPLY

Login before adding your answer.

Traffic: 3766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6