Question

Randomly Split A Fastq File

0

Entering edit mode

11.2 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

We have one fastq file, which we would like to split into three smaller fastq files. This could be probably done with the split command ( and a multiplier of 4).

But what we would like to do is create 10 times triplicates of this one fastq file. So I would like to know if there is a way of splitting a fastq files randomly and still keeping the four lines structure of the fastq file.

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

Thanks in advance for any idea.

Assa

fastq split • 5.9k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 11.2 years ago by Assa Yeroslaviz ★ 1.9k

Ram · Answer 1 · 2013-09-19

3

Entering edit mode

11.2 years ago

brentp 24k

Here is one solution:

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 11.2 years ago by brentp 24k

0

Entering edit mode

Thanks for the script. It seems to work, though I am getting an error after a few minutes.

AS the fastq files is zipped, this is the command I'm using:

python  SplitReads.py. fastq.gz 10 3

After a few minutes I am getting a chunk size massage

chunk_size: 3436054

But than the script stops without any errors, but only with the traceback massage:

 Traceback (most recent call last):
   File "SplitFastqFile.py", line 61, in <module>
        fqsplit(fq, nchunks, nreps)
   File "SplitFastqFile.py", line 49, in fqsplit
        for i, fqr in zip(ints, fqiter(fq)):
   File "SplitFastqFile.py", line 24, in fqiter
        with xopen(fq) as fh:

Is it a memory problem? I hope you can help

Thanks, Assa

ADD REPLY • link 11.1 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I updated the script just now (to use izip in place of zip). Give another try.

ADD REPLY • link 11.1 years ago by brentp 24k

1

Entering edit mode

NO it is still not working. I can run it with the unzipped files, but not with the gzipped ones. I can't understand why.

ADD REPLY • link 11.1 years ago by Assa Yeroslaviz ★ 1.9k

score 1 · Answer 2 · 2013-09-19

1

Entering edit mode

11.2 years ago

cts ★ 1.7k

You could select random samples of the reads using seqtk

ADD COMMENT • link 11.2 years ago by cts ★ 1.7k

1

Entering edit mode

Yes, but I don't want to just extract a specific number of reads from a file. I would like to split the file into three parts, so that I don't get the same read in two different samples of one one triplicate. With seqtk I can extract a subsample, but if I do it twice there might be repetitions in the two files.

ADD REPLY • link 11.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

This answer is wrong and should be given -1.

ADD REPLY • link 7.2 years ago by scchess ▴ 640

score 1 · Answer 3 · 2013-09-19

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

To recover random reads in constant time, you could pull the file into memory, into an array, storing byte offsets to a newline character before the start of a new read.

In the course of reading the FASTQ file into memory, you can strip newlines between reads, as you are storing offsets in an index-to-offset hash table.

Then, generally:

Having counted the number of lines (4n) in the file, divide by four (n).
Build a list of indices from {1..n}.
Permute that list.
To extract reads, iterate through the list and, for a given index i, extract four lines from the byte offset after index i to the byte offset before index i+1.

A lot of scripting languages have efficient permutation libraries (example).