Question

normalize different size fastq files

0

Entering edit mode

3.7 years ago

serene.s • 0

Hi I have different sizes of fastq files eg one is 3 GB and other one is 18 GB in size. For analysis I want to randomly select reads from the larger files to make them upto 5 GB in size. I want to know if there is any programme or script that can do this.

Thank You Saraswati

metagenome data genomics normalization • 2.2k views

ADD COMMENT • link 3.6 years ago by serene.s • 0

1

Entering edit mode

What type of data is this, and what is the ultimate goal of the analysis?

ADD REPLY • link 3.7 years ago by rpolicastro 13k

0

Entering edit mode

This is shotgun metagenome data and my goal is to to do taxonomic classification by alignment using Diamond and later import the output to megan6 community edition to classify. But the larger fastq files are taking very longtime to give diamond blast output also the larger files are producing output in GB size which is difficult for megan to analyse as it hangs my system.

thanks for responding :-)

ADD REPLY • link 3.7 years ago by serene.s • 0

score 1 · Answer 1 · 2021-04-01

By size probably not but there are many options to select by read number, testing it out with a few sizes can help you hone in on the number that is closest to the size you want (Google either of the tools seqtk or seqkit for more details)

seqtk sample

prints:

Usage:   seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>

Options: -s INT       RNG seed [11]
         -2           2-pass mode: twice as slow but with much reduced memory

or using seqkit:

seqkit sample -h

prints:

sample sequences by number or proportion.

Usage:
  seqkit sample [flags]

Flags:
  -h, --help               help for sample
  -n, --number int         sample by number (result may not exactly match)
  -p, --proportion float   sample by proportion
  -s, --rand-seed int      rand seed (default 11)
  -2, --two-pass           2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin

score 1 · Answer 2 · 2021-04-01

You are asking for two different operations. Down-sampling and normalization are different operations.

You can use reformat.sh from BBMap suite to simply down-sample the larger file. Following options are relevant.

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

Depending on the aim of the analysis it may be more appropriate to normalize the data using bbnorm.sh. A guide is available.