Question

Extracting randomly subset of fastq reads from a huge file

1

Entering edit mode

7.6 years ago

KVC_bioinfo ▴ 600

Hello,

I am trying to extract a random subset of paired end fastq reads from a huge 5Gb file. I tried using seqtk and zcat. However, it is taking long time.

The command I used:

zcat huge_file.fastq.gz | sort -R | head -1000 > reads.sample_random.fq

Could someone please help me? Thank you in advance.

random fastq extract • 14k views

ADD COMMENT • link updated 7.6 years ago by Bioinformatics_NewComer ▴ 330 • written 7.6 years ago by KVC_bioinfo ▴ 600

1

Entering edit mode

Ummm, won't sort -R completely screw everything up (and be slow and use a LOT of memory)? Define "a long time", you're mostly limited by IO and gzip.

ADD REPLY • link 7.6 years ago by Devon Ryan 105k

0

Entering edit mode

long time: more than 2 hours

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

I found these commands. However, what is -s (seed)? And is it sampling randomly?

Thank you in advance.

Apply a seed to extract the same reads from two, paired end files:

seqtk sample -s 10 /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq

seqtk sample -s 10 /ebs/ecoli/SRR001666_2.fastq.gz 1000 > SRR001666_2_1000.fastq

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

In this particular case the same seed is needed to keep the pairing of reads (i.e. extract the same read pair from two files).

ADD REPLY • link 7.6 years ago by GenoMax 150k

0

Entering edit mode

Thank you. But what does the number after s represent?

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

1

Entering edit mode

It's a random seed. It doesn't matter what number you use, you just have to use the same one (I usually use 1234 for things like this).

ADD REPLY • link 7.6 years ago by Devon Ryan 105k

0

Entering edit mode

Got it. thank you very much.

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

Worked perfectly for me. Thank you, everyone, for responses.

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

score 5 · Answer 1 · 2017-08-25

5

Entering edit mode

7.6 years ago

GenoMax 150k

reformat.sh from BBMap suite.

reformat.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=Sampled_R1.fq.gz out2=Sampled_R2.fq.gz parameters_below

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.

ADD COMMENT • link 7.6 years ago by GenoMax 150k

1

Entering edit mode

Ah, I got ninja'd :)

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

So shall I use reads=100 (no of reads I want to extract?) parameter?

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

2

Entering edit mode

"reads=100" means it will sample from 100 reads. So if you set"reads=100 samplerate=0.5" you'll get approximately 50 reads, sampled from the first 100 reads in the file. Whereas if you set "samplereadstarget=100" you will get exactly 100 reads, sampled from the full file.

Note that if your reads are paired, these will be the number of PAIRS you get, so the number of reads would be twice that.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you for the explanation.

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 600

score 1 · Answer 2 · 2017-08-25

I assume huge_file.fastq.gz is interleaved?

Using BBMap, you can do this:

reformat.sh in=huge_file.fastq.gz out=subsampled.fastq.gz samplerate=0.1 interleaved

With 2 files you can use in1= and in2= instead. There's also "samplereadstarget=X" if you want a specific number of reads (for paired data, it will give you that number of pairs).

score 1 · Answer 3 · 2017-08-27

1

Entering edit mode

7.6 years ago

Bioinformatics_NewComer ▴ 330

See if you'd like to try this with VSEARCH.

ADD COMMENT • link 7.6 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

I think VSEARCH subsampling doesn't work with fastq files, only with fasta.

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

It allows:

    Subsampling
  --fastx_subsample FILENAME  subsample sequences from given FASTA/FASTQ file

I'm using: vsearch v2.0.3

ADD REPLY • link 7.6 years ago by Bioinformatics_NewComer ▴ 330