bwa mapping to several similar strains, restrict maximum hits?
2
1
Entering edit mode
10.1 years ago
Adrian Pelin ★ 2.6k

Hello,

I want to use bwa mem or aln, to align reads to 3-5 strains of a virus. However, if a read can map with equal quality/score to more than 1 strain, i want the read to be reported as unaligned. Basically, only reads that map to one strain only with a best score should be considered.

The genomes of the 5 strains are in a fasta file, each genome is present in one DNA molecule. The tricky part, is that vaccinia viruses have IR at the 3' and 5' ends of their genomes, so this is a repetitive regions within one strain.

What approach would you guys recommend? For sampe and samse I see the -n option can be set to 1 to accomplish what I need, but that might also mean that IR regions will not have mappings.

Adrian

bwa sam bwa-mem bwa-aln fasta • 3.1k views
ADD COMMENT
2
Entering edit mode
10.0 years ago

I think the easiest way to do this would be with BBSplit, which is specifically designed for handling ambiguity between and within organisms differently. It maps to all references at once and outputs reads to one file per reference, only the one they map to best. The "ambig" flag allows you to specify the behavior for reads that map ambiguously (discard, retain, randomly place, etc), while the "ambig2" flag allows you to specify the behavior for reads that specifically map ambiguously between references - in other words, it will not affect the behavior of a read that maps to only one organism, but matches two different locations within that one organism.

ADD COMMENT
1
Entering edit mode
10.1 years ago

Allow secondary alignments, then sort your alignments by read name and toss any reads that map to multiple strains. That way you've directly answered your biological question of interest.

ADD COMMENT
0
Entering edit mode

Sounds like a good approach. So with bwa mem it's just the -a parameter. Then I would have to work with the sam file produced. Any hint as how this sorting might happen? I am not really good with the sam format, still trying to learn.

ADD REPLY
0
Entering edit mode

You can sort by read name using samtools sort with the -n flag. Take a look at this answer: Can A Bam File Be Sorted By The Read Names

ADD REPLY
0
Entering edit mode

Okay, I have some progress. I managed to do it and now I see what is going on. A lot of reads are mapping to more than one reference with a perfect match, CIGAR string is only M. But now I have a few problems:

1) There is no score reported in the sam file. When I say score, i mean how well the read mapped. So if it matches perfectly to 3 references and the 4th has only one mismatch, I need to look at the CIGAR string. Is there any way to sort how well the read mapped to the different references? Or is that done already?

2) Because there is no score, there is no way for me to remove reads that match with identical scores to more than one reference.

Am I missing something here?

ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6