Hi all. I am using BWA mem to map reads from a ChiP experiment which comes from an F1. To avoid mapping bias we are mapping them to a diploid genome consisting of both parent genomes. To get unique reads we filter them by q20 but we also want to recover all the multi-map reads which map twice, once to the father and once to the mother genome. We do not want to lose all those positions which are TF binding site, there is just no variation at the site. Since we are doing this for many F1s losing all those non-variant sites would give us very biased results.
I know MAPQ=0 is ambiguous reads but is there a way to distinguish reads that map twice and reads that map 3 or more times with equal score?
Your help would be appreciated.
Thinking out aloud rather than offering a concrete solution.
I am not sure which aligner you are using and how it handles multi-mappers but this is a difficult task for any aligner (having a diploid genome present in reference). Have you looked at the results to see what is happening? Which aligner are you using? It may be better to align to just one haploid parent version and then look for number of alternate alleles present at sites you know.
Hi. We are using BWA-mem (now BWA-mem2). Using a diploid genome was the only way to truly eliminate mapping bias. In plants, we have a lot of structure variation so personal genome approaches left to many artefacts.
But are you able to see
in the alignments? I assume your reference genome contains something line
chr1_parent1
andchr1_parent2
type names. If you were to extracts a set of aligned reads do you see them aligning to both genomes (if you ignore thetwice
requirement).Yes, chromosomes have the names of their parents to distinguish. We use a chain file from WGA to connect which parts of each chromosome belong to each other. Currently, with BWA mem, reads are assigned by random to each position if the mapping score is equal which in this case would be 0. At the moment, we keep reads with a Mapping quality of >20 (>99% unique) and 0 to keep ambiguous reads. So we merge the bam from q<20 and q=0. But instead of keeping all multimap reads, we would like to keep the once that map equally to parent 1 and parent 2 chromosomes, but not reads that map more than twice with an equal score.
That is exactly what I was referring to in my initial comment.
That is almost certainly going to require some kind of post processing. I would suggest that you try
bbmap.sh
withambig=all
option which would align the reads to all possible locations. I don't thinkbwa
has that kind of an option. Then you will need to process the bam file to find reads that fit your requirement, which will need custom code.