Downsampling dataset with more than 60 million reads
10
6
Entering edit mode
10.3 years ago
mike ▴ 90

Hello,

Which tool can perform downsampling for RNAseq dataset with more than 60 million paired reads in fastq? I have used seqtk but it has issues with large memory consumptions.

Thanks

next-gen RNA-Seq • 35k views
ADD COMMENT
0
Entering edit mode

Picard DownsampleSam?

ADD REPLY
0
Entering edit mode

thanks but I want to downsample data at fastq level

ADD REPLY
1
Entering edit mode

My apologies, I haven't read the question carefully enough.

ADD REPLY
11
Entering edit mode
10.3 years ago
lh3 33k

The seqtk downsampling.

1. Downsample a fraction of reads (useful when you know the total number):

seqtk sample read1.fa.gz 0.2 > sub1.fa
seqtk sample read2.fa.gz 0.2 > sub2.fa

It takes little memory.

2. Downsample a fixed number (streaming mode):

seqtk sample read1.fa 20000 > sub1.fa
seqtk sample read2.fa 20000 > sub2.fa

It uses the reservoir sampling on a stream. The peak RAM equals the size of output, independent of the input.

3. Downsample a fixed number (2-pass mode):

seqtk sample -2 read1.fa.gz 20000 > sub1.fa
seqtk sample -2 read2.fa.gz 20000 > sub2.fa

It reads the input twice (so twice as slow). In the first pass, it finds the sampled read indices. In the second pass, it outputs reads at the stored indices. The peak RAM is about the number of sampled reads multiplied by 24, again independent of the input. You need the latest seqtk for this 2-pass mode.

ADD COMMENT
0
Entering edit mode

I think it is essential to set the same seed ( -s ) for both read pairs so the selection is identical.

ADD REPLY
6
Entering edit mode
8.4 years ago

reformat.sh from the BBMap package can also do exact subsampling using a small, fixed amount of memory. The normal mode reads the file once and samples at a fixed rate (e.g. samplerate=0.1 will output about 10% of the reads). Exact mode can be used with "samplereadstarget" or "samplebasestarget". Each will read the file twice, and output exactly that number of reads or bases randomly sampled from throughout the file (obviously, the number of bases has granularity of the read length; it's not going to output half a read). It does not require reads be uniform length, and can handle gzipped input, interleaved or paired fastqs, multiline fasta, etc.

To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k
ADD COMMENT
1
Entering edit mode
10.3 years ago
Irsan ★ 7.8k

Put this in a file called sample_N_fastq.py

# Usage: python sample_N_fastq.py forward.fastq reverse.fastq 20000

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
            suba.write(fha.readline())
            subb.write(fhb.readline())
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" % (suba.name, subb.name)

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)

And use it on unzipped fastq-files like this to get 1M reads:

[your prompt]$ python sample_N_fastq.py forward.fastq reverse.fastq 1000000
ADD COMMENT
2
Entering edit mode

This will produce a sample with replacement, meaning that the same record may be selected more than once. There are two simpler solutions.

One is to decide the probability of printing a record and iterate with that, this will not ensure exact number of records but may be just fine for downsampling and would produce results of the same type as the above. The code would be simpler and more efficient as it would not need to sort the array.

If the exact number of records is desired then a better solution would be to shuffle the record numbers, slice the top N then sort them and proceed with the rest of the code above.

See also the post: Selecting Random Pairs From Fastq?

ADD REPLY
0
Entering edit mode

Thanks for the code Irsan. Will I get downsampled data in fastq format?

ADD REPLY
0
Entering edit mode

One little problem with this code - you never close fqa and fqb files. Either use: with open(fqa) as suba: ... or call suba.close() at the end of write_random_records.

Also this implementation allows for picking single record multiple times. That means that if you'll try to sub-sample to 50% size, it is almost certain that some fastq reads will be duplicated.

ADD REPLY
1
Entering edit mode
10.3 years ago

I wrote a tool in C called sample that was designed to get past memory limitations in GNU shuf, which can choke on shuffling inputs of the scale dealt with in genomic studies. I haven't compared it with seqtk, but perhaps sample might be of use to you for FASTQ and other large-scale genomic datasets.

The sample program stores an eight-byte integer for every location of a per-line or per-multiple-line element in a text-formatted input file. These locations are shuffled and a uniformly-random sample is taken of the desired sample size.

For a, say, 4 GiB computer that can allocate 3.5 GiB of memory to sample, up to 470M four-line FASTQ records can be sampled without replacement (235M, with replacement):

$ sample --lines-per-offset=4 --sample-size=${N} reads.fastq > sample.fastq

If the FASTQ records are paired, paired samples can be derived by making a FASTQ file where one record is followed immediately by its pair, and then using sample with --lines-per-offset=8.

The increased lines-per-offset parameter allows sampling (without replacement) a paired input file containing up to 940M paired records (470M, with replacement) on a 4 GiB system allocating 3.5 GiB to sample:

$ sample --lines-per-offset=8 --sample-size=${N} paired_reads.fastq > paired_sample.fastq

The file paired_reads.fastq can be created by linearizing the two FASTQ files, using paste to interleave them with a newline delimiter, and then delinearizing the resulting sample.

ADD COMMENT
1
Entering edit mode
10.3 years ago
Fred ▴ 790

Maybe something like that could help:

#!/usr/bin/perl
use strict;
use warnings;
#
# Usage ./sampleFastq.pl <fastq r1> <fastq r2> <outFastq r1> <outFastq r2> <prob of keeping reads>
#
open(FASTQF,$ARGV[0]);
open(FASTQR,$ARGV[1]);
open(FASTQOUTF,">".$ARGV[2]);
open(FASTQOUTR,">".$ARGV[3]);
my $proba = $ARGV[4];
my $line1;
my $line2;
my $nbLines = 1;
my $random;
my $fqRecord1;
my $fqRecord2;
while($line1=<FASTQF>){
    $line2=<FASTQR>;
    $fqRecord1.=$line1;
    $fqRecord2.=$line2;
    if($nbLines%4==0){
        $random = rand(1);
        if($random <= $proba){
            print FASTQOUTF $fqRecord1;
            print FASTQOUTR $fqRecord2;
        }
        $fqRecord1="";
        $fqRecord2="";
    }
    $nbLines++;
}
close(FASTQOUTR);
close(FASTQOUTF);
close(FASTQR);
close(FASTQF);

It takes the two FastQ files as input, the names of output Files, and a probability of taking each read. The probability may be computed as: number of desired reads / total number of reads.

If you want to apply this script on gzipped files:

./sampleFastq.pl <(gunzip -c f.fastq.gz) <(gunzip -c r.fastq.gz) >(gzip -c - > f_sample.fastq.gz) >(gzip -c - > r_sample.fastq.gz) 0.5
ADD COMMENT
1
Entering edit mode

Yep nice idea, for 60M reads and 1M sample it would give +/-1000 reads deviation.

ADD REPLY
1
Entering edit mode
8.4 years ago

If you want a uniform sample over the entire file, and if all the reads are the same length in both files you can use a tool I've developed that is about as efficient as possible: mdshw5/strandex. It takes the file size, determines offsets within the file to start from, and matches FASTQ entries occurring after those offsets. This way you're reading only slightly more data that you sample, and an exact number of reads can be specified. Memory usage is <1MB and sampling time scales linearly with the number of reads sampled.

ADD COMMENT
1
Entering edit mode
6.4 years ago
vmicrobio ▴ 290

you may try something in awk :

cat file.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) {printf("\n");} else { printf("\t");} }' | 
awk -v k=8000 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | 
awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "downsamp-file.fastq"}'
ADD COMMENT
0
Entering edit mode
10.3 years ago
SES 8.6k

If you specify the fraction of reads to sample with seqtk instead of the number, then only one read will be kept in memory at a time. For example,

seqtk sample -s100 read1.fq 0.1 > sub1.fq

That is not clear from the documentation, but the author does indicate this in another biostar thread ( A: Selecting Random Pairs From Fastq? ).

ADD COMMENT
0
Entering edit mode

I am down sampling data based on lowest number of reads among paired fastq files. I don't think I can use fraction in this case.

ADD REPLY
0
Entering edit mode

I'm not sure I understand what you mean. In your question you stated that you have 60 million paired sequences in fastq, and if you have a method for determining the lowest number you need, then just convert that to a fraction.

ADD REPLY
2
Entering edit mode

Just added a 2-pass mode to seqtk to trade speed for smaller peak memory.

ADD REPLY
1
Entering edit mode

Yes, for 60 million, fraction is preferred. 60 million 100bp reads would require at least 60M*100*2=12GB memory, plus the memory taken by the read names. There are ways to significantly reduce the memory with two-pass file reading.

ADD REPLY
0
Entering edit mode
10.3 years ago

My 5 cents.

First usually FASTQ is assumed unsorted so you can take just the first X reads.

Then the general solution (Java) is to create an array of 1:60 000 000 and then use Collections.shuffle, take first X numbers and check on each 4 lines of FASTQ check if those are in those first X shuffled numbers. Here is the example using Groovy, it is not using more than 4GB RAM for it and takes several seconds on a laptop to generate the index (lineNumbersFilter). Main limiting factor is I/O speed as hash search is quite fast.

int nLines = 60000000, sampleSize = 10000000

def fastq1 = "R1.fastq", fastq2 = "R2.fastq", out1 = "R1_ds.fastq", out2 = "R2_ds.fastq"

def lineNumbers = new ArrayList<Integer>(0..<nLines)

Collections.shuffle(lineNumbers)

def lineNumbersFilter = new HashSet<Integer>(lineNumbers[0..<sampleSize])

int n = 0

new File(out1).withPrintWriter { pw1 ->

    new File(out2).withPrintWriter { pw2 ->

        new File(fastq1).withReader { reader1 ->

            new File(fastq2).withReader { reader2 ->

                def read1, read2

                while (((read1 = reader1.readLine()) != null) &&
                        ((read2 = reader2.readLine()) != null)) {

                    read1 += "\n" + reader1.readLine() +
                            "\n" + reader1.readLine() +
                            "\n" + reader1.readLine()

                    read2 += "\n" + reader2.readLine() +
                            "\n " + reader2.readLine() +
                            "\n" + reader2.readLine()

                    if (lineNumbersFilter.contains(n)) {
                        pw1.println(read1)
                        pw2.println(read2)
                    }
                }

                n++
            }
        }
    }
}
ADD COMMENT
0
Entering edit mode

I *think* reads in a fastq from Illumina sequencers are in the order as they appear on the lane of the flow cell. So they are randomized in terms of genomic position but technically they might be biased, which is the reason why the first reads are usually bad...

ADD REPLY
0
Entering edit mode

Sure it won't protect from flow cell artifacts, however it usually works fine and only first several reads are of very bad quality.

ADD REPLY
0
Entering edit mode
2.7 years ago
chunshi • 0

description="\ Example: $0 [number of reads | percentage of reads] e.g. $0 1,000,000 # sampling 1 million reads $0 0.9 # sampling 90% of reads "

if [ "$#" -eq 0 ]; then clear cat <<< "$description" exit 0 fi

seqtk=PATH/TO/seqtk

N=$1 N=$(sed 's/,//g' <<< $N) out_dir="sampling_$N"

mkdir -p $out_dir

for fq in *.gz do echo "processing $fq" q "$seqtk sample -s100 $fq $N | gzip> $out_dir/$fq" sleep 1 done

echo "The output sampling fastq files: ./$out_dir"

ADD COMMENT

Login before adding your answer.

Traffic: 2096 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6