Question

Split Super Large Files

3

Entering edit mode

13.1 years ago

Bioscientist ★ 1.7k

I think we always could come arcoss large files, say, fastq files. Now I have really super large fastq files, around 10GB. I need to first split it into some smaller files. I've written some python script to do this, with the algorithm of taking the whole file into memory (similar to readlines(), or zcat file), thus leading to insufficiency of memory (60GB memory on cluster node, but still not enough)

Just wondering, is there any algorithm which doesn't take the whole file, but read line by line? Anyone wanna share any script for splitting?

BTW, I'm not doing BWA paired-end mapping; but curious is it possible to run BWA with an input file with size around 10GB? thx

Actually I'm using python to split. One of the key command here is:

input = commands.getoutput('zcat ' + fastqfile).splitlines(True)

Seems a bit faster than readlines(); but basically the idea is still to create "list" or in perl called "array". Then I can manipulate specific line of the list, say list[1000] (the 1000th line)

split • 8.1k views

ADD COMMENT • link updated 13.1 years ago by Gjain 5.8k • written 13.1 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

I've successfully run BWA with files over 20 GB (compressed) in size.

ADD REPLY • link 13.1 years ago by Farhat ★ 2.9k

0

Entering edit mode

Also, what are you trying to do? 60GB is more than enough for most alignment needs unless you have a really large genome, whereas it may fall short for assembly and splitting your reads would not help much here.

ADD REPLY • link 13.1 years ago by Farhat ★ 2.9k

0

Entering edit mode

I've successfully run BWA on compressed fastq files over 20GB in size (around 40 GB uncompressed) on a machine with 18GB of RAM.

ADD REPLY • link 13.1 years ago by Farhat ★ 2.9k

0

Entering edit mode

python has a gzip module--no need to the zcat call.

ADD REPLY • link 13.1 years ago by Sean Davis 27k

score 10 · Answer 1 · 2011-11-11

10

Entering edit mode

13.1 years ago

brentp 24k

I agree with Wen.Huang that it's not that large, but it will help parallelization if you split.

You definitely do not need to read the entire file into memory. There's a unix command, split that does what you want. Here's an example.

# make a fake example file.
for i in `seq 10000`; do echo $i >> t.fake; done

# should be a multipe of 4 for fastq
LINES_PER_FILE=1000

split  -l $LINES_PER_FILE t.fake output_prefix
ls output_prefix*

Will create:

output_prefixaa  output_prefixac  output_prefixae  output_prefixag  output_prefixai
output_prefixab  output_prefixad  output_prefixaf  output_prefixah  output_prefixaj

ADD COMMENT • link 13.1 years ago by brentp 24k

6

Entering edit mode

There is also csplit, which splits files based on line context.

ADD REPLY • link 13.1 years ago by Neilfws 49k

0

Entering edit mode

A small perl code as an example:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD REPLY • link 13.1 years ago by Gjain 5.8k

0

Entering edit mode

In perl you can do it in 2 ways:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD REPLY • link 13.1 years ago by Gjain 5.8k

0

Entering edit mode

hmm, didn't know about csplit. thanks.

ADD REPLY • link 13.1 years ago by brentp 24k

0

Entering edit mode

thanks. BUt I'm kind of compressed files....for which I cannot use "split" commands.

ADD REPLY • link 13.1 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

@bioscientist

gunzip -c input.fastq.gz | split -l 2000 - output_prefix

ADD REPLY • link 13.1 years ago by brentp 24k

0

Entering edit mode

great thanks! works super!

ADD REPLY • link 13.1 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

csplit example: http://41j.com/blog/2011/01/split-fasta-file-into-files-with-one-contig-per-file/ csplit -z myfile.fa '/>/' '{*}'

ADD REPLY • link 13.1 years ago by Ying W ★ 4.3k

score 2 · Answer 2 · 2011-11-11

2

Entering edit mode

13.1 years ago

Wen.Huang ★ 1.2k

1) 10G is not super large... 2) almost all programing language can read line by line, for example, you could simply create a few filehandles with perl and read one line at a time and iteratively write through them 3) yes, bwa can handle 10GB

ADD COMMENT • link 13.1 years ago by Wen.Huang ★ 1.2k

score 1 · Answer 3 · 2011-11-11

1

Entering edit mode

13.1 years ago

Gjain 5.8k

A small perl example of what Brentp mentioned:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD COMMENT • link 13.1 years ago by Gjain 5.8k

3

Entering edit mode

this is not really a Perl example, it's a shell example being called by perl ...

ADD REPLY • link 13.1 years ago by Michael 55k

0

Entering edit mode

Thankyou for clearing it out. I know its a shell example implemented in perl. Will mention it clearly in future.

ADD REPLY • link 13.1 years ago by Gjain 5.8k