Hi
I have different sizes of fastq files eg one is 3 GB and other one is 18 GB in size. For analysis I want to randomly select reads from the larger files to make them upto 5 GB in size. I want to know if there is any programme or script that can do this.
This is shotgun metagenome data and my goal is to to do taxonomic classification by alignment using Diamond and later import the output to megan6 community edition to classify. But the larger fastq files are taking very longtime to give diamond blast output also the larger files are producing output in GB size which is difficult for megan to analyse as it hangs my system.
By size probably not but there are many options to select by read number, testing it out with a few sizes can help you hone in on the number that is closest to the size you want (Google either of the tools seqtk or seqkit for more details)
seqtk sample
prints:
Usage: seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>
Options: -s INT RNG seed [11]
-2 2-pass mode: twice as slow but with much reduced memory
or using seqkit:
seqkit sample -h
prints:
sample sequences by number or proportion.
Usage:
seqkit sample [flags]
Flags:
-h, --help help for sample
-n, --number int sample by number (result may not exactly match)
-p, --proportion float sample by proportion
-s, --rand-seed int rand seed (default 11)
-2, --two-pass 2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
Hi Istvan Albert
Thanks for your response I have one more question which is out of this topic. Do you have any idea what should be the e value while doing Diamond blastx against nr database? Their default value is 1e-3. I want to find hits for my whole metagenome sequence reads. And how would it affect my results if I select evalue 1e-20 for getting faster results?
You are asking for two different operations. Down-sampling and normalization are different operations.
You can use reformat.sh from BBMap suite to simply down-sample the larger file. Following options are relevant.
reads=-1 Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1 Skip (discard) this many INPUT reads before processing the rest.
samplerate=1 Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1 Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0 (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0 (sbt) Exact number of OUTPUT bases desired.
Depending on the aim of the analysis it may be more appropriate to normalize the data using bbnorm.sh. A guide is available.
What type of data is this, and what is the ultimate goal of the analysis?
This is shotgun metagenome data and my goal is to to do taxonomic classification by alignment using Diamond and later import the output to megan6 community edition to classify. But the larger fastq files are taking very longtime to give diamond blast output also the larger files are producing output in GB size which is difficult for megan to analyse as it hangs my system.
thanks for responding :-)