Hello everyone,
As I was playing with BWA, I came across something strange.
I made a ref.fasta file with two copies of the same genome sequence (I changed the second copy's name).
I was expecting to see the reads mapped on one of the two references and the alignment on the other one displayed in the XA tag.
What's strange is that, for some reads, the first in pair was mapped on one reference and its mate on the other, breaking down the pair.
Example:
ERR656485.14 81 gi|9629357|ref|NC_001802.2| 31 0 151 S129M1D20M gi|9629357|ref|NC_001802.1| 26 0 TTGTATGTATT ACATGATGACGGTGATCGTGTTTGACAGTGTCTTCATCATTGTTTGTTTTTTTTTTTTTTTCAAGCAGAAGACGG CATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAGCCCGGGAGCTCTC TGGCTATCTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCT GTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCN ),. )))))))))))-)((((((((.-,(((.)))).))((:.)).)444)(.(,,.(,-(-4FFB>FFFFFFFFFFFF FFFFFFF7FDDFGCC4*FGGGGGGGGGFECGGGGECGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCB8! NM:i:7 AS:i:116 XS:i:116 ERR656485.14 161 gi|9629357|ref|NC_001802.1| 26 0 134 M1D20M146S gi|9629357|ref|NC_001802.2| 31 0 NGCCCGGGAGC TCTCTGGCTATCTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCC GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCAAGATCGG AAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAACACTAACA CAGCACCCAGCAGAGAGTAACTACATGGCACAGACTAATACAGTAGCAGTCACACACAACACAT !8A CCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGCFGGGFDGGGGGGFGGGGGGGGGGG GGGGGGGGGGGGGGGGGGDDGGFGGFGGGFFFGFFGFFFFDFFDFFFFFFFF5@ACAEFFFFFFFFB>@(,(,(, ,((),(().4(((((-((((((((())).)-)))),,()(((()))))))))))--.)(-)-((-((4((,) NM:i:8 AS:i:117 XS:i:117
Pierre Lindenbaum and I did some tests and found that if we shuffled the fastqs, some paired reads were back to what was expected, except for the flags (0x2 unset) and mapq (0).
Example:
ERR656485.14 81 gi|9629357|ref|NC_001802.2| 31 0 151 S129M1D20M = 26 -155 TTGTATGTATTACATGATGACGGTGATCGTGTTTG ACAGTGTCTTCATCATTGTTTGTTTTTTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATC GGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAGCCCGGGAGCTCTCTGGCTATCTAGGGAACCCACTGCT TAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAG ATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCN ),.)))))))))))-)((((((((.-, (((.)))).))((:.)).)444)(.(,,.(,-(-4FFB>FFFFFFFFFFFFFFFFFFF7FDDFGCC4*FGGGGGG GGGFECGGGGECGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCB8! NM:i:7 MD:Z:16A48C 55C0T3A2^C19A0 AS:i:116 XS:i:116 XA:Z:gi|9629357|ref|NC_0018 02.1|,-31,151S129M1D20M,7; ERR656485.14 161 gi|9629357|ref|NC_001802.2| 26 0 134 M1D20M146S = 31 155 NGCCCGGGAGCTCTCTGGCTATCTAGGGAACCCAC TGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACT AGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGT GTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAACACTAACACAGCACCCAGCAGAGAGTAACTAC ATGGCACAGACTAATACAGTAGCAGTCACACACAACACAT !8ACCGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGFGGGCFGGGFDGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDDGGFG GFGGGFFFGFFGFFFFDFFDFFFFFFFF5@ACAEFFFFFFFFB>@(,(,(,,((),(().4(((((-(((((((( ())).)-)))),,()(((()))))))))))--.)(-)-((-((4((,) NM:i:8 MD:Z:0A3T16 A48C55C0T3A2^C20 AS:i:117 XS:i:117 XA:Z:gi|9629357|ref |NC_001802.1|,+26,134M1D20M146S,8;
Does someone have an explanation for that?
Did I do something wrong?
Aurélien
Note: Aurelien is my internship student. He is working on a blast2sam software ( https://github.com/guyduche/Blast2Bam ). When he compared the output of Blast , he was surprised to see some "properly paired-reads" with blast but not with bwa ( read & mate mapped on different contigs) . The reference used was the sequence of HIV duplicated with two different names.
How much is some? Is it 50%:50% for paired:broken? Or something like 95%:5%?
I did a samtools flagstat on the tests results.
1 ref:
2 refs:
2 refs with fastq shuffled: