I tried running BWA in parallel by splitting the input FASTQ file into 'n' splits and ran single ended alignment (aln and samse step) in parallel using 'n' bwa processes. I finally merged the SAM file output from all the processes.
When I compare the merged SAM file output with the SAM file generated by running BWA on the entire FASTQ, I find that considerable number of reads differ in their alignment ( in all the splits except the first one). I was assuming that the alignment is independent per read and splitting the FASTQ file and running BWA in parallel should work.
Could someone advise me if its correct to run BWA in the above fashion ?
That behavior is 100% expected. Did you try blasting that first sequence? Did you notice how it aligns equally well to many sites? The "XT:A:R" tells you that the sequence matches many sites qually well. bwa just picks one to assign it to randomly. So run it twice, different reads will be randomly assigned to different places. I bet all of the reads that exhibit this behavior have the XT:A:R tag.
Thank you for the reply.. Yes, you are right.. All the reads have XT:A:R tag.
If I run the alignment multiple times I get the same output, i.e. the reads are assigned to the same position again and again. It is not giving random positions each time.
This alignment difference occurs only between the merged (from the split fastq files) scenario and the complete output ( from the entire fastq file). Even if bwa randomly assigns positions to the reads, it does that consistently which I verified by running both the scenarios multiple times.
You know, for generating random numbers, you need a seed. BWA always uses the same seed and thus the sequence of random numbers is always the same, but not when you run on two separate files.
Now I understand.. Thanks for the explanation. This is my final question..
Can I safely neglect the difference in alignment output as it is not a correctness issue ? As pointed out earlier by swbarnes2, it is just that the reads align equally well to different locations and one of them is chosen in random.
So, is it valid and acceptable to divide the input fastq into multiple splits and run different BWA processes in parallel and finally merge the output ?
Yes, that is acceptable. However, since there is no way to tell where the reads that map to multiple locations actually came from, there will be ambiguity in placing those reads at the position BWA has mapped to in that particular run. This is also why most of the analysts prefer to work on unique reads.
I am not sure if I understand what you refer as "repetitive reads". Am attempting a single ended alignment and I get different alignment output as below:
But in the first split, everything matches. If randomness was involved and the first split contains a few 'multimappers', that is, reads with MAPQ=0, there should be some differences. Can you elaborate on that a bit?
Hi, Why dont you try pBWA which runs on MPI (message parsing interface) by self. You dont need to split fastq files for that.
So, for a reference my qsub script for running the pBWA on 24 nodes looks like this :
# pBWA script for aligning the raw fastq short read file to the reference genome of mouse (mm9)
# Author : Sukhdeep Singh
# Usage : qsub ~/src/useFul/pBWA/pBWA_raw2sam.sh $1 $2 - not being used YET
# $1 = fastqfile, $2= number of cpu
######################################################################
#!/bin/bash
# qsub parameters beginning with #$
# Use bash as the interpreter
#$ -S /bin/bash
#before executing the command thange to the working directory
#$ -cwd
#submit the complete environment to the execution host
#$ -V
# merge stdout and stderr
#$ -j y
file=$1
# set your output file
#$ -o out.txt
# now initialize the parallel environment with 24 cpus
#$ -pe orte 24
#now we can run our command here.
/opt/openmpi/bin/mpirun -np 24 --mca btl_tcp_if_include eth0 pBWA aln -f $file.sai /biodata/biodb/ABG/genomes/mouse/mm9/mm9 $1
/opt/openmpi/bin/mpirun -np 24 --mca btl_tcp_if_include eth0 pBWA samse -f $file.sam /biodata/biodb/ABG/genomes/mouse/mm9/mm9 $file.sai $1
Then, I get 24 sai indexes and 24 sam files for each auto splitted fastq files, which I concatenate and sort later on.
I'm trying this on my cluster which has 2 compute nodes with 8 cpus each. The ideal setup will be to run bwa aln process over 16 cpus over the 2 compute nodes.
I've SGE and MPI, etc..
While running the script and submit with qsub it just splits the process in multiple parts but it's just using one of my compute nodes.
So for me -pe orte 16 and -np 16.
It's spawning 16 process over 1 node, i need the 16 processes to split up (8 procs on node 0 and 8 procs on node 1).
I suppose this is the sai files you were talking about:
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00000.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00001.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00002.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00003.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00004.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00005.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00006.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00007.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00008.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00009.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00010.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00011.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00012.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00013.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00014.sai
-rw-r--r-- 1 gmarco users 0 Nov 13 13:30 aln_left-1-00015.sai
Ok I already tried with -np 16 yesterday. It took like 4 hours to align 1 lane. Using all my 2 nodes (16cpus). While with normal BWA i could split left and right lane align with 8 threads per node and get the alignment part done in 4 ~ 5 hours for both.
Today I test pBWA with 8 threads option -t 8 and no mpi, orte or whatever and it took only 45 min to get align of one lane.
So what's the point of using 16 cpu's and split results if it's much slower.
I think the problem is somewhere in splitting the jobs over two nodes, as you said it takes 45 mins to using -t 8, so it efficiently uses single node but 4 hours using both nodes, doesn't makes sense as I think its not running in parallel, just one after the other. Do you have some sys admin, there to help you out or just try the pBWA people!!
It should be a RAM issue. If you've enough RAM per process i think splitting the whole task into lot of small chunks will be the best. I've powerfull cpus but i lack lot of RAM.
If anyone with a big cluster 100GB + RAM could test this. It would be awesome.
That behavior is 100% expected. Did you try blasting that first sequence? Did you notice how it aligns equally well to many sites? The "XT:A:R" tells you that the sequence matches many sites qually well. bwa just picks one to assign it to randomly. So run it twice, different reads will be randomly assigned to different places. I bet all of the reads that exhibit this behavior have the XT:A:R tag.
Thank you for the reply.. Yes, you are right.. All the reads have XT:A:R tag.
If I run the alignment multiple times I get the same output, i.e. the reads are assigned to the same position again and again. It is not giving random positions each time.
This alignment difference occurs only between the merged (from the split fastq files) scenario and the complete output ( from the entire fastq file). Even if bwa randomly assigns positions to the reads, it does that consistently which I verified by running both the scenarios multiple times.
You know, for generating random numbers, you need a seed. BWA always uses the same seed and thus the sequence of random numbers is always the same, but not when you run on two separate files.
Now I understand.. Thanks for the explanation. This is my final question..
Can I safely neglect the difference in alignment output as it is not a correctness issue ? As pointed out earlier by swbarnes2, it is just that the reads align equally well to different locations and one of them is chosen in random.
So, is it valid and acceptable to divide the input fastq into multiple splits and run different BWA processes in parallel and finally merge the output ?
Thanks again for all your help..
Yes, that is acceptable. However, since there is no way to tell where the reads that map to multiple locations actually came from, there will be ambiguity in placing those reads at the position BWA has mapped to in that particular run. This is also why most of the analysts prefer to work on unique reads.
Thanks Arun for the explanation..
I am not sure if I understand what you refer as "repetitive reads". Am attempting a single ended alignment and I get different alignment output as below:
Sample Read1:
SRR062634.1918792 0 1 184972930 0 100M * 0 0 CCATTACATAATGGTAAAGGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAGATTCATAAAGCAAGTC GGEGGGGGGFGEGGGGGGFA?EEEEGGDDDGGFGFGGGGBFGGGGGGGEGGGFDEGGGGEGF=EG?EGGEGBGEEEEEEEEEEBE@AC@?C=CBACCA;@ XT:A:R NM:i:0 X0:i:1056 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
SRR062634.1918792 0 8 47856174 0 100M * 0 0 CCATTACATAATGGTAAAGGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAGATTCATAAAGCAAGTC GGEGGGGGGFGEGGGGGGFA?EEEEGGDDDGGFGFGGGGBFGGGGGGGEGGGFDEGGGGEGF=EG?EGGEGBGEEEEEEEEEEBE@AC@?C=CBACCA;@ XT:A:R NM:i:0 X0:i:1056 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
Sample Read 2
SRR062634.1918838 0 GL000226.1 14792 0 100M * 0 0 CAGAAACTTCTTTGTGATGTGTGCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTTAAACACTCTTTCTGTAGTATCTGGAAGTGAA DDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDCDDDCDBDBDCBBB@CBBBBBBBCBBBBBBBABB@BBBBBB=BB=BBBBABBB?BAABABB??AB?BA? XT:A:R NM:i:1 X0:i:8 X1:i:3XM:i:1 XO:i:0 XG:i:0 MD:Z:68G31
SRR062634.1918838 0 GL000226.1 2516 0 100M * 0 0 CAGAAACTTCTTTGTGATGTGTGCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTTAAACACTCTTTCTGTAGTATCTGGAAGTGAA DDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDCDDDCDBDBDCBBB@CBBBBBBBCBBBBBBBABB@BBBBBB=BB=BBBBABBB?BAABABB??AB?BA? XT:A:R NM:i:1 X0:i:8 X1:i:3XM:i:1 XO:i:0 XG:i:0 MD:Z:68G31
Around 4000 reads differ out of 3,00,000 reads.
But in the first split, everything matches. If randomness was involved and the first split contains a few 'multimappers', that is, reads with MAPQ=0, there should be some differences. Can you elaborate on that a bit?