I would like to count the number of reads within a genomic region.
This is the command I would like to use
samtools view -f 0x02 -q 40 in.bam 1:10000000-10001000 | sort | uniq | wc -l
I add the sort
and uniq
because in some instances I might be counting the same read twice.
My problem is that I need to count the reads for this positions and 500k more positions for over 200 samples.
As it stands my code is written like this
fork_process(16); # perl Parallel::ForkManager module to fork 16 processes
foreach my $bam_file (@bam_files) {
start->new_process
open OUT, ">".$bam_file."_out.txt";
foreach my $pos (@positions) {
my $command = "samtools view -f 0x02 -q 40 $bam_file $pos | sort | uniq | wc -l";
my $count = ($command);
print OUT $pos,"\t",$count,"\n";
}
end->process;
close OUT;
}
wait_all_child_process;
I am paralleling the process in sets of 16 samples and writing out to separate files. For some reason my code runs fast and then runs slow.
I tried working with MPI but no avails (samtools is not on MPI on my cluster). I know some C and python. Any advice?
I want my code to run fast as possible, please help!
I don't think my cluster is SGE/OGE? I'm not sure, they say to use MPI for parallel processing across nodes.
What grid or job scheduler software do you use? This could be Sun Grid Engine, Oracle Grid Engine, PBS, etc.
SLURM on the fast cluster and PBS on the slightly slower one. 24 cpus per node vs 16
What you could do is review the SGE script and replace the qsub statements with equivalent commands to start jobs via SLURM. http://www.sdsc.edu/~hocks/FG/PBS.slurm.html