Selecting Random Pairs From Fastq?
12
30
Entering edit mode
13.7 years ago
Ketil 4.1k

I'd like to select a random subset of Illumina PE reads from a set of files. I realize this is pretty trivial to do, but I thought I'd ask if there's a simple program available to do so?

It's important that it retains both ends of each pair, and in the right order, and that it can work by streaming over the input sequences (obviously they don't fit in RAM).

Or do I just have to write it myself?

random fastq illumina sequence • 92k views
ADD COMMENT
4
Entering edit mode

Not sure I follow - are you saying a random sampling might introduce bias that wasn't there already? This seems to go against the definition of 'random'.

ADD REPLY
0
Entering edit mode

Do you expect your coverage to be even, otherwise you might get in hot water by introducing biases when randomly sampling over the whole genome/region: Downsampling Bam Files

ADD REPLY
63
Entering edit mode
12.6 years ago
lh3 33k

An old question. With seqtk, to sample a fixed number of reads:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Remember to use the same random seed -s. It uses reservoir sampling, the same as R::yield(), which requires to keep 10000 records in memory in the above example. For sampling a large number of sequences, using a fraction is preferred as this method only keeps 1 record in memory. Here is an example:

seqtk sample -s100 read1.fq 0.1 > sub1.fq
seqtk sample -s100 read2.fq 0.1 > sub2.fq
ADD COMMENT
1
Entering edit mode

Easy to use and effective. Thanks a lot.

ADD REPLY
1
Entering edit mode

is seqtk sample using sampling by replacement or without replacement?

ADD REPLY
0
Entering edit mode

I think it is safe to assume that the sampling is without replacement.

This is because the name of the sequence is not changed moreover most aligners require that sequences that belong to different templates be named differently

ADD REPLY
0
Entering edit mode

I couldn't find much information about the seqtk sample function, and I was wondering if anyone knows what happens when you choose a seed value which is too large for your original number of reads, considering the target size you set for your subsample.

E.g. fastq file contains 15,000,000 reads, I want to sub set a group of 10,000,000 reads from those, using a seed value of 1,000 (because I have too many other files to process together, and some are way bigger, so I thought I could keep the seed value). Do you think it would start reading again from the beginning of the file when it reaches the end, causing duplication of reads?

I tested this and setqk generated the 10,000,000 reads subset, I fastqc'd the subset and the original files and there wasn't strong differences in duplication (even though it would have reached the end of the original file after only 15,000 reads).

ADD REPLY
0
Entering edit mode

Can't say about seqtk but for seqkit I confirm that seqkit does sampling without replacement. I also observed that the when I sample using proportion parameter (say --proportion 0.6) using different seeds for each replicate, the common reads between replicates amounts for up to 60% of total reads sampled.

ADD REPLY
0
Entering edit mode

I am having an issue using seqtk to subsample some illumina reads prior to assembly with MIRA. I got the following message after I tried to subsample 1,000,000 reads from an illumina fastq paired end file:

MIRA found duplicate read names in your data (see log above for more info). 
*                                                                         *
*This should never, never be!

When I subsample with seqtk is there a chance that it will choose the same sequence twice in a subsampling? Just FYI, I am subsampling 1,000,000 reads from an 8,000,000 read paired end file set.

ADD REPLY
0
Entering edit mode

It sounds like you're passing in a file that has interleaved reads. Is that true? In that case, you might get randomly get a forward and reverse pair right next to each other.

If so, you need to use seqtk on both the forward and reverse files independently, then interleave them.

ADD REPLY
0
Entering edit mode

If I want to subsample 10m, 20m and 50m reads from a 70m reads file of fwd and rev, should I use different seed values for different read depts? I want to have same read pairs in fwd and rev, but random read pairs in different read depths. Also does it accept gzipped files?

ADD REPLY
0
Entering edit mode

I'm hoping you've solved your problem in the last four months, but for future reference, I was wondering the same thing and YES you do have to change the seed or else it will generate the same 'random' sample.

ADD REPLY
0
Entering edit mode

are the sub sampled reads the same in file1 and file2? if not, is there a way to take from the first sub sampled file reads, the paired read of the second one?

ADD REPLY
1
Entering edit mode

Yes, as long as you use the same -s.

ADD REPLY
0
Entering edit mode

What is the function of -s?

ADD REPLY
1
Entering edit mode

That's random seed. Same seed ensures same "random" results (useful for reproducibility).

ADD REPLY
0
Entering edit mode

Thank you. But what does the number after -s in command line represent?

ADD REPLY
0
Entering edit mode

he told you that is the random seed. Something that is used in any initial simulation step. In R is set.seed for example before perform simulation..

ADD REPLY
0
Entering edit mode

here set -s is 100, then the random was selected from what rang? If the number larger or smaller , there was difference?

ADD REPLY
0
Entering edit mode

Hi! I have two fastq files, each with 2522553 reads in them. I want to do a sub sampling to have only 46485 reads in each fastq file, so I was thinking of using seqtk with the following commands for each fastq file:

seqtk sample -s100 cp_trim_1P.fq 46485 > cp1.100cov.fq

seqtk sample -s100 cp_trim_2P.fq 46485 > cp2.100cov.fq

My doubt is if the random seed number used (-s 100) is the correct for the number of reads I want to sub sample.

Thanks in advance :)

ADD REPLY
1
Entering edit mode

the seed shouldn't influence how many reads are returned; it influences which reads are selected rather than how many are selected (provided you've stated how many reads you want)

ADD REPLY
1
Entering edit mode

Thank you for your answer. I thought that the seed number have some influence in the number of reads. So using as a seed number 100 (or 4, like my teacher did in an example) should not matter, as long as i use the same seed number for both fastq files?

ADD REPLY
0
Entering edit mode

Yes, that is correct. The seed only affects how the reads are selected.

ADD REPLY
0
Entering edit mode

Thank you ningyuan.sg :)

ADD REPLY
0
Entering edit mode

Is there a way to check if the subsampling has worked? for example, if I write seqtk sample -s100 read1.fq 10000 > sub1.fq , How can I check if the reads have been subsampled to the quantity specified (10000)?

ADD REPLY
0
Entering edit mode

If all you need are the number of sequences count the number of lines and divide by 4

echo $(cat reads.fq | wc -l)/4 | bc 

To actually verify that the sampling was done as claimed, well that would be a much harder task to verify.

ADD REPLY
24
Entering edit mode
13.7 years ago

In one command line:

paste f1.fastq f2.fastq |\ #merge the two fastqs
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ #merge by group of 4 lines
shuf  |\ #shuffle
head |\ #only 10 records
sed 's/\t\t/\n/g' |\ #restore the delimiters
awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.
ADD COMMENT
1
Entering edit mode

never heard of shuf before - turns out I have it (Fedora) - will the wonders ever cease

ADD REPLY
1
Entering edit mode

Hi This script seems to be a good solution. However I had a small problem - The fastq headers of the second file are deleted partially. Any solution ? Regards, P

ADD REPLY
1
Entering edit mode

I believe that I was able to solve your problem, paulorapazote, by making the \t field delimiter explicit for awk in the last line (with the -F option), i.e.:

awk -F '\t' '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.
ADD REPLY
0
Entering edit mode

If "shuf" standard?

ADD REPLY
0
Entering edit mode

I found it on all the linux systems I used.

ADD REPLY
0
Entering edit mode

Huh. It's not on any of mine, but I do see it's a GNU util. Thanks...learned something new!

ADD REPLY
0
Entering edit mode

Likewise. I wrote my own years ago b/c I understood there not to be such a GNU util.

ADD REPLY
0
Entering edit mode

Hi Pierre, I am curious, how do you know the right pairs are being paired? is that done by the merge function? thanks.

ADD REPLY
0
Entering edit mode

what is the "right" pairs ? in my script , two files are temporarily merged into one file.

ADD REPLY
0
Entering edit mode

well, then the pairs are in the same order in each file, but that makes sense, thanks!

ADD REPLY
16
Entering edit mode
13.7 years ago
Martin Morgan ★ 1.6k

In Bioconductor, the ShortRead package has

library(ShortRead)
f1 <- FastqSampler("file1.fastq", n=1e6)
f2 <- FastqSampler("file2.fastq", n=1e6)

set.seed(123L); p1 <- yield(f1)
set.seed(123L); p2 <- yield(f2)

Assuming reads are ordered identically in each file. FastqSampler subsets the file in a single pass, keeping the ith record if i <= n or a random (0, 1] deviate is less than n / i; in the latter case the ith read replaces a randomly chosen read in the n currently saved (actually, the implementation does this on chunks of reads, in line with how R works best).

ADD COMMENT
8
Entering edit mode
13.7 years ago
brentp 24k

I had some code to do this for single end here: https://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

You can easily make that work for paired end by iterating over fq1 and fq2 in parallel. Actually, I edited it so it will do as you suggest and paste below.

You call it like:

$ python get_subset.py /path/to/reads_1.fastq /path/to/reads_2.fastq 100

and it will create new files with the same name as your paired files, but ending in .subset and containing 100 random records.

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
            suba.write(fha.readline())
            subb.write(fhb.readline())
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" % (suba.name, subb.name)

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)
ADD COMMENT
2
Entering edit mode

I think rand_records = sorted(random.sample(xrange(records), int(N)))

Would do the job.

ADD REPLY
0
Entering edit mode

This bit is slick. Very nice.

records = sum(1 for _ in open(fqa)) / 4
rand_records = sorted([random.randint(0, records) for _ in xrange(N)])
ADD REPLY
0
Entering edit mode

thanks. it works nicely if you're not worried about leaving open file-handles (which i am not).

ADD REPLY
0
Entering edit mode

Thanks for the script. I'm pretty sure the following lines should be altered Otherwise it returns the wrong number of reads

  1. rand_records = sorted([random.randint(1, records) for _ in xrange(N)])
  2. rec_no = 0
  3. for i in range(4):
        rec_no += 1
        suba.write(fha.readline())
        subb.write(fhb.read
    
ADD REPLY
0
Entering edit mode

@aunderwo, I changed it to do random.randint(0, records - 1), I think everything else works as it should.

ADD REPLY
0
Entering edit mode

@brentp, you didn't comment on 2) from @aunderwo. I also agree that rec_no should be initialized 0 instead of -1. Right?

ADD REPLY
0
Entering edit mode

@brentp Thanks for the update. I still think you need the rec_no += 1 after the for i in range(4): otherwise over time it returns less records then specified. I have tested this with a range of N

ADD REPLY
0
Entering edit mode

@anderwo Thanks. You are correct. Updated. BTW, I have been using this which does everything correctly (now that I just adjusted the upper bound to randint). It even checks to make sure it writes the correct number of records.

ADD REPLY
0
Entering edit mode

@brentp: thanks for this. One potential issue when generating the list of record indices is that the same index can appear multiple times. Something like this would enforce uniqueness:

rand_records = []
  while len(rand_records) &lt; N:
    r = random.randint(0,records-1)
    if r not in rand_records: rand_records.append(r)
  rand_records = sorted(rand_records)

but isn't very elegant :(

ADD REPLY
0
Entering edit mode

What about:

rand_records = random.sample(range(0, records), N)
rand_records.sort()

random.sample performs random selection without replacement, so the check you've added above would just be built-in to the call.

ADD REPLY
7
Entering edit mode
13.7 years ago

In the past, I've used the following paste/awk trick to combine the two files into a single line, assign a random number to each pair, sort by the random number, then choose the first N pairs you wish.

One command line, but broken up with comments for clarity.

# Starting FASTQ files
export FQ1=1.fq
export FQ2=2.fq

# The names of the random subsets you wish to create
export FQ1SUBSET=1.rand.fq
export FQ2SUBSET=2.rand.fq

# How many random pairs do we want?
export N=100

# paste the two FASTQ such that the 
# header, seqs, seps, and quals occur "next" to one another
  paste $FQ1 $FQ2 | \
# "linearize" the two mates into a single record.  Add a random number to the front of each line
  awk 'BEGIN{srand()}; {OFS="\t"; \
                        getline seqs; getline sep; getline quals; \
                        print rand(),$0,seqs,sep,quals}' | \
# sort by the random number
  sort -k1,1 | \
# grab the first N records
  head -n $N | \
# Convert the stream back to 2 separate FASTQ files.
  awk '{OFS="\n"; \
        print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
        print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
ADD COMMENT
0
Entering edit mode

Hoopss, I've just realized that my answer was very close to yours

ADD REPLY
0
Entering edit mode

Thanks Aaron!!!

ADD REPLY
4
Entering edit mode
13.7 years ago
Ketil 4.1k

Thanks for all the answers. Being bored, I also wrote my own program for this (http://malde.org/~ketil/biohaskell/biolib/examples/RSelectPE.hs), but the examples above illustrates some nice techniques to achieve this using other tools, so I hope you'll forgive me for also attacking the problem myself.

Most solutions shuffle the files and takes the heads, I expect that brentp's solution of only sorting the chosen indices (if I understand correctly) will be more efficient, even if Python is likely to have a slower sort/random than Unix tools (shuf and sort).

I also like the trick by Martin Morgan of running a random sampling routine twice with the same initial seed. In my own implementation, I also select each read based on a random value, but I work on pairs of files simultaneously, which requires more plumbing.

Again, thanks to everybody who contributed!

ADD COMMENT
0
Entering edit mode

+1 for the open source program

ADD REPLY
4
Entering edit mode
10.3 years ago

You could use sample:

$ sample -k 1000 -l 4 huge_file.fastq > 1k_sample.fq

This uses reservoir sampling on newline offsets to reduce memory overhead. If your system is thrashing for lack of system memory, this will help a great deal by reducing the memory overhead required for sampling as compared with other tools.

Including GNU sort, which will often run out of memory on genome-scale inputs, because, while it offers reservoir sampling, it stores the entire line in memory, for every line of input. For genome-scale work, this is often simply too much data.

Specify -k N for the number N of samples, and -l L for number of lines per record-to-sample (L). In the case of fastq, that would be four lines per record, so -l 4.

Add -d S to specify a seed value S, if desired, or it is drawn from a Mersenne Twister PRNG.

Generating your own random seed and applying that same seed when sampling two fastq files will let you grab the same samples from two files of paired reads:

$ sample -k 1000 -l 4 -s 1234 first_pair.fastq > 1k_first_pair_sample.fq
$ sample -k 1000 -l 4 -s 1234 second_pair.fastq > 1k_second_pair_sample.fq

If you have one interleaved file, then you could specify -l 8 to sample a pair of records from every eight lines:

$ sample -k 1000 -l 8 interleaved.fastq > 1k_paired_sample.fq

You can also add -r to sample with replacement; the default is to sample without replacement. Or add -p to print a sample in the order provided by the original input.

ADD COMMENT
0
Entering edit mode

Thank you for your response, it is helpful for me as it doesn't require a large amount of memory.

ADD REPLY
1
Entering edit mode
12.6 years ago
chris.mit7 ▴ 60

Bit old, but here's my solution

import random
f = open('CD4CD45RO_GATCAG_s_1_1_sequence.txt')
f2 = open('CD4CD45RO_GATCAG_s_1_2_sequence.txt')
o = openf.name+'.sample', 'wb')
o2 = openf2.name+'.sample', 'wb')
row = f.readline()
sampleSize = 1000000
tSize = 66000000
f2pos = []
p = float(sampleSize)/float(tSize)
while row:
    if row[0] == '@':
        if random.random():
            f2pos.append(f.tell()-len(row))
            o.write(row)
            for i in xrange(3):
            o.write(f.readline())
    row = f.readline()
for pos in f2pos:
    f2.seek(pos)
    for i in xrange(4):
        o2.write(f2.readline())

I'm not sure how long the others take, this one is done in like ~10 minutes on 66 million reads.

ADD COMMENT
0
Entering edit mode
13.6 years ago
Paquola • 0

I just saw in the man page that shuf has this option:

   -n, --head-count=COUNT
          output at most COUNT lines

This is potentially much more efficient than permuting the whole file and then | head.

But never mind... Tested it, and shuf puts the whole thing in memory. It's not really different than | head.

ADD COMMENT
0
Entering edit mode
10.3 years ago
mondiraray • 0

http://seqanswers.com/forums/showthread.php?t=12070

((Used the HTSeq tool as recommended by Simon))

I also ran out of memory when I tried Aaron, Pierre, and Martin's recommendations. Hope this helps for anyone else also having memory issues. Thanks.

ADD COMMENT
0
Entering edit mode
8.6 years ago
plijnzaad ▴ 40

Dunno if still relevant, but I remember having written a tool for this that used one pass rejection sampling, see https://github.com/plijnzaad/phtools/blob/master/ngs/fastq-subsample-paired.pl

ADD COMMENT
0
Entering edit mode

Isn't there a neater way of doing this nowadays? with some command from Samtools or a tool like that, maybe? thanks anyways for sharing your script

ADD REPLY
0
Entering edit mode
18 months ago

I believe that the easiest option for paired-end data is to use fastq-sample from fastq-tools:

fastq-sample -n 5000000 pair_R1.fastq.gz pair_R2.fastq.gz -o pair_5M_R
ADD COMMENT

Login before adding your answer.

Traffic: 2761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6