Split Super Large Files
3
3
Entering edit mode
13.1 years ago
Bioscientist ★ 1.7k

I think we always could come arcoss large files, say, fastq files. Now I have really super large fastq files, around 10GB. I need to first split it into some smaller files. I've written some python script to do this, with the algorithm of taking the whole file into memory (similar to readlines(), or zcat file), thus leading to insufficiency of memory (60GB memory on cluster node, but still not enough)

Just wondering, is there any algorithm which doesn't take the whole file, but read line by line? Anyone wanna share any script for splitting?

BTW, I'm not doing BWA paired-end mapping; but curious is it possible to run BWA with an input file with size around 10GB? thx

Actually I'm using python to split. One of the key command here is:

input = commands.getoutput('zcat ' + fastqfile).splitlines(True)

Seems a bit faster than readlines(); but basically the idea is still to create "list" or in perl called "array". Then I can manipulate specific line of the list, say list[1000] (the 1000th line)

split • 8.1k views
ADD COMMENT
0
Entering edit mode

I've successfully run BWA with files over 20 GB (compressed) in size.

ADD REPLY
0
Entering edit mode

Also, what are you trying to do? 60GB is more than enough for most alignment needs unless you have a really large genome, whereas it may fall short for assembly and splitting your reads would not help much here.

ADD REPLY
0
Entering edit mode

I've successfully run BWA on compressed fastq files over 20GB in size (around 40 GB uncompressed) on a machine with 18GB of RAM.

ADD REPLY
0
Entering edit mode

python has a gzip module--no need to the zcat call.

ADD REPLY
10
Entering edit mode
13.1 years ago
brentp 24k

I agree with Wen.Huang that it's not that large, but it will help parallelization if you split.

You definitely do not need to read the entire file into memory. There's a unix command, split that does what you want. Here's an example.

# make a fake example file.
for i in `seq 10000`; do echo $i >> t.fake; done

# should be a multipe of 4 for fastq
LINES_PER_FILE=1000

split  -l $LINES_PER_FILE t.fake output_prefix
ls output_prefix*

Will create:

output_prefixaa  output_prefixac  output_prefixae  output_prefixag  output_prefixai
output_prefixab  output_prefixad  output_prefixaf  output_prefixah  output_prefixaj
ADD COMMENT
6
Entering edit mode

There is also csplit, which splits files based on line context.

ADD REPLY
0
Entering edit mode

A small perl code as an example:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD REPLY
0
Entering edit mode

In perl you can do it in 2 ways:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD REPLY
0
Entering edit mode

hmm, didn't know about csplit. thanks.

ADD REPLY
0
Entering edit mode

thanks. BUt I'm kind of compressed files....for which I cannot use "split" commands.

ADD REPLY
0
Entering edit mode

@bioscientist

gunzip -c input.fastq.gz | split -l 2000 - output_prefix

ADD REPLY
0
Entering edit mode

great thanks! works super!

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
13.1 years ago
Wen.Huang ★ 1.2k

1) 10G is not super large... 2) almost all programing language can read line by line, for example, you could simply create a few filehandles with perl and read one line at a time and iteratively write through them 3) yes, bwa can handle 10GB

ADD COMMENT
1
Entering edit mode
13.1 years ago
Gjain 5.8k

A small perl example of what Brentp mentioned:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");

hope this helps.

ADD COMMENT
3
Entering edit mode

this is not really a Perl example, it's a shell example being called by perl ...

ADD REPLY
0
Entering edit mode

Thankyou for clearing it out. I know its a shell example implemented in perl. Will mention it clearly in future.

ADD REPLY

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6