Hello,
I want to use bwa mem or aln, to align reads to 3-5 strains of a virus. However, if a read can map with equal quality/score to more than 1 strain, i want the read to be reported as unaligned. Basically, only reads that map to one strain only with a best score should be considered.
The genomes of the 5 strains are in a fasta file, each genome is present in one DNA molecule. The tricky part, is that vaccinia viruses have IR at the 3' and 5' ends of their genomes, so this is a repetitive regions within one strain.
What approach would you guys recommend? For sampe and samse I see the -n option can be set to 1 to accomplish what I need, but that might also mean that IR regions will not have mappings.
Adrian
Sounds like a good approach. So with bwa mem it's just the -a parameter. Then I would have to work with the sam file produced. Any hint as how this sorting might happen? I am not really good with the sam format, still trying to learn.
You can sort by read name using samtools sort with the -n flag. Take a look at this answer: Can A Bam File Be Sorted By The Read Names
Okay, I have some progress. I managed to do it and now I see what is going on. A lot of reads are mapping to more than one reference with a perfect match, CIGAR string is only M. But now I have a few problems:
1) There is no score reported in the sam file. When I say score, i mean how well the read mapped. So if it matches perfectly to 3 references and the 4th has only one mismatch, I need to look at the CIGAR string. Is there any way to sort how well the read mapped to the different references? Or is that done already?
2) Because there is no score, there is no way for me to remove reads that match with identical scores to more than one reference.
Am I missing something here?