Hi, I am trying to split a fastq stream using a python script into files with defined number of reads and immediately start a mapping job on newly created files. All this is happening on an SGE cluster, that's why I want to map reads in small chunks in parallel. The problem I encounter is that starting from one of the files I generate always starts somewhere inside a read, not the start. It can be in different locations of the read, and cam happen with different file number. Because of that mapping fails, obviously. This is my script.
import argparse
import os
from os import path
from subprocess import call
import sys
parser = argparse.ArgumentParser()
parser.add_argument("outfolder", help="folder to store new files")
parser.add_argument("prefix", help="prefix for the new filename")
parser.add_argument("N_reads", help="number of reads per file",
default=1000000, type=int)
args = parser.parse_args()
def read_write():
i = 0
n = 0
filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
f = open(filename, 'w')
print('Writing to '+ filename)
try:
for line in sys.stdin:
f.write(line)
i += 1
if i == args.N_reads*4:
f.close()
print('Mapping', filename)
call('ssh headnode1.ecdf.ed.ac.uk "cd /exports/igmm/datastore/wendy-lab/ilia/scripts; qsub launcher.sh ' + filename + '"', shell=True)
n += 1
filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
i = 0
f = open(filename, 'w')
print('Writing to '+ filename)
except EOFError:
f.close()
if i == 0:
os.remove(filename)
print('Removed empty file')
print('Done')
read_write()
The only assumption I make is that all reads are written in 4 lines with no blank lines or any weirdness. The files were generated by Illumina software, then I concatenated gzipped files and now I decompress them with pigz to stdout, then my script reads it from stdin. So I think this assumption should be met. Is there anything else I am missing here?..
use split : http://linux.die.net/man/1/split
Thanks, I know about this, but I want to start mapping while I'm still splitting the file. Maybe it's too much to ask of course...
Do you think the time lost splitting then mapping is going to be smaller or greater than the time spent figuring out how to split-and-map on the fly? :)
Also, plan for what happens when a mapper/node goes down in the middle of processing - if you map and split on the fly, you'll have to start from the beginning
My guess is it will be smaller, and you are probably right that I should just forget this. But except for the benefit of splitting on-the-fly, I am really surprised this is not working as it should and would like to understand why...
How about splitting the file with xargs?
Can you please provide an example? I am not very familiar with bash unfortunately.
Thanks! That's interesting!
Don't know. My guess it is a network filesystem problem. However, these days, the better way to map reads is to process them in one go with multiple threads. As to reading compressed files, many mappers natively deal with that or can read data from a stream.