Is there a way to sample sequences from the fastq.gz
file? I have files that contain hundreds of millions of reads, I want to sample about 100 million reads from each file. Is there a way to do this?
Is there a way to sample sequences from the fastq.gz
file? I have files that contain hundreds of millions of reads, I want to sample about 100 million reads from each file. Is there a way to do this?
BBTools's reformat tool does this, keeping pairs together if they are interleaved:
reformat.sh in=reads.fq.gz out=sampled.fq.gz samplereadstarget=100000000
or for reads in 2 files:
reformat.sh in1=reads1.fq.gz in2=reads2.fq.gz out=sampled1.fq.gz out2=sample2.fq.gz samplereadstarget=100000000
You can also just sample a fraction of the reads without respect to the final number, which is faster since it only needs to read the file once:
reformat.sh in=reads.fq.gz out=sampled.fq.gz samplerate=0.1
or just sample the first X reads, which is faster still:
reformat.sh in=reads.fq.gz out=sampled.fq.gz reads=100000000
Reformat is multithreaded and extremely fast.
This gives the following error
sh ~/downloads/bbmap/reformat.sh in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000
java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000
Exception in thread "main" java.lang.ClassFormatError: jgi.ReformatReads (unrecognized class file version)
at java.lang.VMClassLoader.defineClass(libgcj.so.10)
at java.lang.ClassLoader.defineClass(libgcj.so.10)
at java.security.SecureClassLoader.defineClass(libgcj.so.10)
at java.net.URLClassLoader.findClass(libgcj.so.10)
at java.lang.ClassLoader.loadClass(libgcj.so.10)
at java.lang.ClassLoader.loadClass(libgcj.so.10)
at gnu.java.lang.MainThread.run(libgcj.so.10)
Is there a way around this?
My java version is
java -version
java version "1.7.0_03"
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode)
That's very strange; it should work with any Java 7 runtime. It seems like a different java is is being called when running the program than when you run java -version
, or that Java is configured incorrectly so that you have the executable for Java 7 but you are not using the standard libraries, since it is loading libgcj (from gnu) rather than the Oracle libraries.
You can try the latest Java 6 build, here, which should work, or else you can install Oracle's Java 7 JRE.
It's also possible that manually pathing to the location of Oracle's java executable would work:
/path/to/Java7/jre/bin/java \
-ea \
-Xmx200m \
-cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads \
in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz \
out=sampled_S11-12810Ca_V2.fastq.gz \
reads=72000000
Edit: Update - BBTools works best with Java 8+ now. It is fully compatible with Java 7 and mostly compatible with Java 6. But if your default Java version is 6 (or 7), I suggest you encourage your sysadmins to upgrade to a modern version of Java.
For reference - Java is fully forward and backward compatible, which is one of the things I love about it. However, you cannot use Java libraries written after your last download of Java. So, if I write a program that uses parallel sort (like Clumpify), which is supported in Java 8 but not Java 7... if you run Java 8 it will be faster, and in Java 7 it will be slower, since I wrote custom code to handle that scenario. But for BBNorm, which uses a Java library class specific to Java 7+, you can't even run it in Java 6 since it uses a class that did not exist then.
The time depends on the speed of the disk subsystem and whether you have pigz installed, which accelerates compression and decompression. But for randomly sampling an exact number of reads, Reformat takes the amount of time needed to read the file twice; for sampling X fraction of the reads, it reads the file once; and for just sampling the first X reads, the amount of time depends on X.
If you have a 10 GB raw file, it could be processed in about 20 seconds (if your disk system is fast enough). A 10 GB raw file that has been gzipped would take more like 125 seconds, assuming decompression at around 80MB/s.
If you want a Galaxy solution, http://toolshed.g2.bx.psu.edu/view/peterjc/sample_seqs / https://github.com/peterjc/pico_galaxy/tree/master/tools/sample_seqs
This offers non-random sampling (for reproducibility), e.g. take 12% of the sequences, or every 100th sequence.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Duplicate of
Selecting Random Pairs From Fastq?
How To Randomly Select 20M Reads From A 200M Fastq File
No, I dont want to extract the files. I saw the question, it assumes that the files are "extracted", no longer a *.gz.