Hello,
Which tool can perform downsampling for RNAseq dataset with more than 60 million paired reads in fastq? I have used seqtk but it has issues with large memory consumptions.
Thanks
Hello,
Which tool can perform downsampling for RNAseq dataset with more than 60 million paired reads in fastq? I have used seqtk but it has issues with large memory consumptions.
Thanks
The seqtk downsampling.
seqtk sample read1.fa.gz 0.2 > sub1.fa
seqtk sample read2.fa.gz 0.2 > sub2.fa
It takes little memory.
seqtk sample read1.fa 20000 > sub1.fa
seqtk sample read2.fa 20000 > sub2.fa
It uses the reservoir sampling on a stream. The peak RAM equals the size of output, independent of the input.
seqtk sample -2 read1.fa.gz 20000 > sub1.fa
seqtk sample -2 read2.fa.gz 20000 > sub2.fa
It reads the input twice (so twice as slow). In the first pass, it finds the sampled read indices. In the second pass, it outputs reads at the stored indices. The peak RAM is about the number of sampled reads multiplied by 24, again independent of the input. You need the latest seqtk for this 2-pass mode.
reformat.sh from the BBMap package can also do exact subsampling using a small, fixed amount of memory. The normal mode reads the file once and samples at a fixed rate (e.g. samplerate=0.1 will output about 10% of the reads). Exact mode can be used with "samplereadstarget" or "samplebasestarget". Each will read the file twice, and output exactly that number of reads or bases randomly sampled from throughout the file (obviously, the number of bases has granularity of the read length; it's not going to output half a read). It does not require reads be uniform length, and can handle gzipped input, interleaved or paired fastqs, multiline fasta, etc.
To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1
or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1
and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k
Put this in a file called sample_N_fastq.py
# Usage: python sample_N_fastq.py forward.fastq reverse.fastq 20000
import random
import sys
def write_random_records(fqa, fqb, N=100000):
""" get N random headers from a fastq file without reading the
whole thing into memory"""
records = sum(1 for _ in open(fqa)) / 4
rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])
fha, fhb = open(fqa), open(fqb)
suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
rec_no = - 1
for rr in rand_records:
while rec_no < rr:
rec_no += 1
for i in range(4): fha.readline()
for i in range(4): fhb.readline()
for i in range(4):
suba.write(fha.readline())
subb.write(fhb.readline())
rec_no += 1 # (thanks @anderwo)
print >>sys.stderr, "wrote to %s, %s" % (suba.name, subb.name)
if __name__ == "__main__":
N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
write_random_records(sys.argv[1], sys.argv[2], N)
And use it on unzipped fastq-files like this to get 1M reads:
[your prompt]$ python sample_N_fastq.py forward.fastq reverse.fastq 1000000
This will produce a sample with replacement, meaning that the same record may be selected more than once. There are two simpler solutions.
One is to decide the probability of printing a record and iterate with that, this will not ensure exact number of records but may be just fine for downsampling and would produce results of the same type as the above. The code would be simpler and more efficient as it would not need to sort the array.
If the exact number of records is desired then a better solution would be to shuffle the record numbers, slice the top N then sort them and proceed with the rest of the code above.
See also the post: Selecting Random Pairs From Fastq?
One little problem with this code - you never close fqa and fqb files. Either use: with open(fqa) as suba: ... or call suba.close() at the end of write_random_records.
Also this implementation allows for picking single record multiple times. That means that if you'll try to sub-sample to 50% size, it is almost certain that some fastq reads will be duplicated.
I wrote a tool in C called sample that was designed to get past memory limitations in GNU shuf
, which can choke on shuffling inputs of the scale dealt with in genomic studies. I haven't compared it with seqtk
, but perhaps sample
might be of use to you for FASTQ and other large-scale genomic datasets.
The sample
program stores an eight-byte integer for every location of a per-line or per-multiple-line element in a text-formatted input file. These locations are shuffled and a uniformly-random sample is taken of the desired sample size.
For a, say, 4 GiB computer that can allocate 3.5 GiB of memory to sample
, up to 470M four-line FASTQ records can be sampled without replacement (235M, with replacement):
$ sample --lines-per-offset=4 --sample-size=${N} reads.fastq > sample.fastq
If the FASTQ records are paired, paired samples can be derived by making a FASTQ file where one record is followed immediately by its pair, and then using sample
with --lines-per-offset=8
.
The increased lines-per-offset parameter allows sampling (without replacement) a paired input file containing up to 940M paired records (470M, with replacement) on a 4 GiB system allocating 3.5 GiB to sample
:
$ sample --lines-per-offset=8 --sample-size=${N} paired_reads.fastq > paired_sample.fastq
The file paired_reads.fastq
can be created by linearizing the two FASTQ files, using paste
to interleave them with a newline delimiter, and then delinearizing the resulting sample.
Maybe something like that could help:
#!/usr/bin/perl use strict; use warnings; # # Usage ./sampleFastq.pl <fastq r1> <fastq r2> <outFastq r1> <outFastq r2> <prob of keeping reads> # open(FASTQF,$ARGV[0]); open(FASTQR,$ARGV[1]); open(FASTQOUTF,">".$ARGV[2]); open(FASTQOUTR,">".$ARGV[3]); my $proba = $ARGV[4]; my $line1; my $line2; my $nbLines = 1; my $random; my $fqRecord1; my $fqRecord2; while($line1=<FASTQF>){ $line2=<FASTQR>; $fqRecord1.=$line1; $fqRecord2.=$line2; if($nbLines%4==0){ $random = rand(1); if($random <= $proba){ print FASTQOUTF $fqRecord1; print FASTQOUTR $fqRecord2; } $fqRecord1=""; $fqRecord2=""; } $nbLines++; } close(FASTQOUTR); close(FASTQOUTF); close(FASTQR); close(FASTQF);
It takes the two FastQ files as input, the names of output Files, and a probability of taking each read. The probability may be computed as: number of desired reads / total number of reads.
If you want to apply this script on gzipped files:
./sampleFastq.pl <(gunzip -c f.fastq.gz) <(gunzip -c r.fastq.gz) >(gzip -c - > f_sample.fastq.gz) >(gzip -c - > r_sample.fastq.gz) 0.5
If you want a uniform sample over the entire file, and if all the reads are the same length in both files you can use a tool I've developed that is about as efficient as possible: mdshw5/strandex. It takes the file size, determines offsets within the file to start from, and matches FASTQ entries occurring after those offsets. This way you're reading only slightly more data that you sample, and an exact number of reads can be specified. Memory usage is <1MB and sampling time scales linearly with the number of reads sampled.
you may try something in awk :
cat file.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) {printf("\n");} else { printf("\t");} }' |
awk -v k=8000 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' |
awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "downsamp-file.fastq"}'
If you specify the fraction of reads to sample with seqtk instead of the number, then only one read will be kept in memory at a time. For example,
seqtk sample -s100 read1.fq 0.1 > sub1.fq
That is not clear from the documentation, but the author does indicate this in another biostar thread ( A: Selecting Random Pairs From Fastq? ).
My 5 cents.
First usually FASTQ is assumed unsorted so you can take just the first X reads.
Then the general solution (Java) is to create an array of 1:60 000 000 and then use Collections.shuffle, take first X numbers and check on each 4 lines of FASTQ check if those are in those first X shuffled numbers. Here is the example using Groovy, it is not using more than 4GB RAM for it and takes several seconds on a laptop to generate the index (lineNumbersFilter). Main limiting factor is I/O speed as hash search is quite fast.
int nLines = 60000000, sampleSize = 10000000
def fastq1 = "R1.fastq", fastq2 = "R2.fastq", out1 = "R1_ds.fastq", out2 = "R2_ds.fastq"
def lineNumbers = new ArrayList<Integer>(0..<nLines)
Collections.shuffle(lineNumbers)
def lineNumbersFilter = new HashSet<Integer>(lineNumbers[0..<sampleSize])
int n = 0
new File(out1).withPrintWriter { pw1 ->
new File(out2).withPrintWriter { pw2 ->
new File(fastq1).withReader { reader1 ->
new File(fastq2).withReader { reader2 ->
def read1, read2
while (((read1 = reader1.readLine()) != null) &&
((read2 = reader2.readLine()) != null)) {
read1 += "\n" + reader1.readLine() +
"\n" + reader1.readLine() +
"\n" + reader1.readLine()
read2 += "\n" + reader2.readLine() +
"\n " + reader2.readLine() +
"\n" + reader2.readLine()
if (lineNumbersFilter.contains(n)) {
pw1.println(read1)
pw2.println(read2)
}
}
n++
}
}
}
}
description="\ Example: $0 [number of reads | percentage of reads] e.g. $0 1,000,000 # sampling 1 million reads $0 0.9 # sampling 90% of reads "
if [ "$#" -eq 0 ]; then clear cat <<< "$description" exit 0 fi
seqtk=PATH/TO/seqtk
N=$1 N=$(sed 's/,//g' <<< $N) out_dir="sampling_$N"
mkdir -p $out_dir
for fq in *.gz do echo "processing $fq" q "$seqtk sample -s100 $fq $N | gzip> $out_dir/$fq" sleep 1 done
echo "The output sampling fastq files: ./$out_dir"
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Picard DownsampleSam?thanks but I want to downsample data at fastq level
My apologies, I haven't read the question carefully enough.