Randomly Split A Fastq File

0

Entering edit mode

11.8 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

We have one fastq file, which we would like to split into three smaller fastq files. This could be probably done with the split command ( and a multiplier of 4).

But what we would like to do is create 10 times triplicates of this one fastq file. So I would like to know if there is a way of splitting a fastq files randomly and still keeping the four lines structure of the fastq file.

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

Thanks in advance for any idea.

Assa

fastq split • 6.4k views

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.8 years ago by Assa Yeroslaviz ★ 1.9k

3

Entering edit mode

11.8 years ago

brentp 24k

Here is one solution:

	"""
	split a single fastq file in to random, non-overlapping subsets
	arguments:
	+ fastq file
	+ number of splits
	+ number of reps

	e.g.:

	python fq.split.py input.fastq 3 4

	will create 12 new files in 4 sets of 3. Each
	set of 3 will contain all of the original records.
	"""

	import gzip
	import random
	from itertools import islice, izip

	xopen = lambda fq: gzip.open(fq) if fq.endswith('.gz') else open(fq)


	def fqiter(fq, n=4):
	with xopen(fq) as fh:
	fqclean = (x.strip("\r\n") for x in fh if x.strip())
	while True:
	rec = [x for x in islice(fqclean, n)]
	if not rec: raise StopIteration
	assert all(rec) and len(rec) == 4
	yield rec

	def fqsplit(fq, nchunks, nreps, prefix=None):
	if prefix == None: prefix = fq + ".split"
	prefix += "chunk-%i.rep-%i.fq"

	fq_size = sum(1 for x in xopen(fq))
	assert fq_size % 4 == 0
	fq_size /= 4 # number of records

	chunk_size = 1 + (fq_size) // nchunks
	print >>sys.stderr, "chunk_size:", chunk_size

	for rep in range(1, nreps + 1):

	files = [open(prefix % (c, rep), 'w') for c in range(1, nchunks + 1)]
	ints = range(fq_size)
	random.shuffle(ints)

	for i, fqr in izip(ints, fqiter(fq)):
	chunk, chunk_i = divmod(i, chunk_size)
	print >>files[chunk], "\n".join(fqr)
	[f.close() for f in files]

	if __name__ == "__main__":

	import sys

	fq = sys.argv[1]
	nchunks = int(sys.argv[2])
	nreps = int(sys.argv[3])
	fqsplit(fq, nchunks, nreps)

view raw fq.split.py hosted with ❤ by GitHub

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.8 years ago by brentp 24k

0

Entering edit mode

Thanks for the script. It seems to work, though I am getting an error after a few minutes.

AS the fastq files is zipped, this is the command I'm using:

python  SplitReads.py. fastq.gz 10 3

After a few minutes I am getting a chunk size massage

chunk_size: 3436054

But than the script stops without any errors, but only with the traceback massage:

 Traceback (most recent call last):
   File "SplitFastqFile.py", line 61, in <module>
        fqsplit(fq, nchunks, nreps)
   File "SplitFastqFile.py", line 49, in fqsplit
        for i, fqr in zip(ints, fqiter(fq)):
   File "SplitFastqFile.py", line 24, in fqiter
        with xopen(fq) as fh:

Is it a memory problem? I hope you can help

Thanks, Assa

ADD REPLY • link 11.7 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I updated the script just now (to use izip in place of zip). Give another try.

ADD REPLY • link 11.7 years ago by brentp 24k

1

Entering edit mode

NO it is still not working. I can run it with the unzipped files, but not with the gzipped ones. I can't understand why.

ADD REPLY • link 11.7 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

11.8 years ago

cts ★ 1.7k

You could select random samples of the reads using seqtk

ADD COMMENT • link 11.8 years ago by cts ★ 1.7k

1

Entering edit mode

Yes, but I don't want to just extract a specific number of reads from a file. I would like to split the file into three parts, so that I don't get the same read in two different samples of one one triplicate. With seqtk I can extract a subsample, but if I do it twice there might be repetitions in the two files.

ADD REPLY • link 11.8 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

This answer is wrong and should be given -1.

ADD REPLY • link 7.8 years ago by scchess ▴ 640

1

Entering edit mode

11.8 years ago

Alex Reynolds 36k

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

To recover random reads in constant time, you could pull the file into memory, into an array, storing byte offsets to a newline character before the start of a new read.

In the course of reading the FASTQ file into memory, you can strip newlines between reads, as you are storing offsets in an index-to-offset hash table.

Then, generally:

Having counted the number of lines (4n) in the file, divide by four (n).
Build a list of indices from {1..n}.
Permute that list.
To extract reads, iterate through the list and, for a given index i, extract four lines from the byte offset after index i to the byte offset before index i+1.

A lot of scripting languages have efficient permutation libraries (example).

ADD COMMENT • link 11.8 years ago by Alex Reynolds 36k

Login before adding your answer.