Hi,
I tried running BWA in parallel by splitting the input FASTQ file into 'n' splits and ran single ended alignment (aln and samse step) in parallel using 'n' bwa processes. I finally merged the SAM file output from all the processes.
When I compare the merged SAM file output with the SAM file generated by running BWA on the entire FASTQ, I find that considerable number of reads differ in their alignment ( in all the splits except the first one). I was assuming that the alignment is independent per read and splitting the FASTQ file and running BWA in parallel should work.
Could someone advise me if its correct to run BWA in the above fashion ?
Thanks, Nabeel
That behavior is 100% expected. Did you try blasting that first sequence? Did you notice how it aligns equally well to many sites? The "XT:A:R" tells you that the sequence matches many sites qually well. bwa just picks one to assign it to randomly. So run it twice, different reads will be randomly assigned to different places. I bet all of the reads that exhibit this behavior have the XT:A:R tag.
Thank you for the reply.. Yes, you are right.. All the reads have XT:A:R tag.
If I run the alignment multiple times I get the same output, i.e. the reads are assigned to the same position again and again. It is not giving random positions each time.
This alignment difference occurs only between the merged (from the split fastq files) scenario and the complete output ( from the entire fastq file). Even if bwa randomly assigns positions to the reads, it does that consistently which I verified by running both the scenarios multiple times.
You know, for generating random numbers, you need a seed. BWA always uses the same seed and thus the sequence of random numbers is always the same, but not when you run on two separate files.
Now I understand.. Thanks for the explanation. This is my final question..
Can I safely neglect the difference in alignment output as it is not a correctness issue ? As pointed out earlier by swbarnes2, it is just that the reads align equally well to different locations and one of them is chosen in random.
So, is it valid and acceptable to divide the input fastq into multiple splits and run different BWA processes in parallel and finally merge the output ?
Thanks again for all your help..
Yes, that is acceptable. However, since there is no way to tell where the reads that map to multiple locations actually came from, there will be ambiguity in placing those reads at the position BWA has mapped to in that particular run. This is also why most of the analysts prefer to work on unique reads.
Thanks Arun for the explanation..
I am not sure if I understand what you refer as "repetitive reads". Am attempting a single ended alignment and I get different alignment output as below:
Sample Read1:
SRR062634.1918792 0 1 184972930 0 100M * 0 0 CCATTACATAATGGTAAAGGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAGATTCATAAAGCAAGTC GGEGGGGGGFGEGGGGGGFA?EEEEGGDDDGGFGFGGGGBFGGGGGGGEGGGFDEGGGGEGF=EG?EGGEGBGEEEEEEEEEEBE@AC@?C=CBACCA;@ XT:A:R NM:i:0 X0:i:1056 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
SRR062634.1918792 0 8 47856174 0 100M * 0 0 CCATTACATAATGGTAAAGGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAGATTCATAAAGCAAGTC GGEGGGGGGFGEGGGGGGFA?EEEEGGDDDGGFGFGGGGBFGGGGGGGEGGGFDEGGGGEGF=EG?EGGEGBGEEEEEEEEEEBE@AC@?C=CBACCA;@ XT:A:R NM:i:0 X0:i:1056 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
Sample Read 2
SRR062634.1918838 0 GL000226.1 14792 0 100M * 0 0 CAGAAACTTCTTTGTGATGTGTGCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTTAAACACTCTTTCTGTAGTATCTGGAAGTGAA DDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDCDDDCDBDBDCBBB@CBBBBBBBCBBBBBBBABB@BBBBBB=BB=BBBBABBB?BAABABB??AB?BA? XT:A:R NM:i:1 X0:i:8 X1:i:3XM:i:1 XO:i:0 XG:i:0 MD:Z:68G31
SRR062634.1918838 0 GL000226.1 2516 0 100M * 0 0 CAGAAACTTCTTTGTGATGTGTGCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTTAAACACTCTTTCTGTAGTATCTGGAAGTGAA DDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDCDDDCDBDBDCBBB@CBBBBBBBCBBBBBBBABB@BBBBBB=BB=BBBBABBB?BAABABB??AB?BA? XT:A:R NM:i:1 X0:i:8 X1:i:3XM:i:1 XO:i:0 XG:i:0 MD:Z:68G31
Around 4000 reads differ out of 3,00,000 reads.
But in the first split, everything matches. If randomness was involved and the first split contains a few 'multimappers', that is, reads with MAPQ=0, there should be some differences. Can you elaborate on that a bit?