Hi all,
This may be a question already and repeatedly asked before, but I could not find an answer relevant to my issue...
I would like to map human/mouse genome sequence data of paired end reads (150 bp x 2 etc.) allowing one end being "uniquely" mapped and the other end "multiply" mapped in order to increase mappability to repetitive regions.
Is this (one end uniquely and the other end multiply mapped) possible for common mapping softwares such as bowtie2, bwa etc. by properly setting some option parameters, or do I need to manually filter (by awk etc.) the output, for which I allow for multimapping reads of both ends?
Thank you for your kind help and advice in advance. schu is offline Reply With Quote
It is unclear what you are actually trying to achieve.
As Istvan Albert already stated, this is already done by most aligners. Specifically, most aligners will preferentially place a multi-mapping read in the location that is consistent with location of the uniquely mapping mate based on the library fragment size distribution. Most popular aligners will even report a non-zero mapq for the multi-mapping mate (since the read pair, as a whole, is not multimapping).
Aligners will still preferentially align the both reads in the read pair in a consistent location. If there's no one best location then they'll output a random location with mapq=0. The exception to this is if the sequence is highly abundant in the reference (e.g. alpha satellite repeats) as, for performance reasons, the aligners will have a limit on the number of candidate locations an alignment 'seed' can have. For bwa this is the
-c
parameter and defaults to 500.That's not how alignment works. If your reference genome has 10 identical copies of a particular region (that is large relative to the fragment size/read length), there's nothing any aligner can do to increase the mappability because will be 5 regions that all have identically good alignments. By default, aligners will randomly choose one of the 5. You could ask the aligner to output all the candidate locations (either as secondary alignments or, for bwa, the XA tag using
-h
), but that doesn't actually change the mappability of any of the regions - it's still an ambiguously aligned read and there's nothing the aligner can do about it^.Is this data (such as mouse xenograft data) where your input reads are a mixture of mouse and human cell that you're trying to align to a joint mouse+human reference? Due to high level of homology across the genomes, you'll get a high multi-mapping rate if you just aligned to a joint reference. This sort of data requires more sophisticated and there are specialised pipeline designed to deal with this.
As a general rule, try to look for and use existing software whenever possible. High quality software development time consuming and writing your own frequently leads to an inferior solution to a general problem. Considering writing your own when there are no options out there, or you are extremely familiar with the state-of-the-art and have concrete solutions to the limitations of the current software.
^ When the homologs aren't identical in the sample being sequenced, there are some sophisticated techniques that can be used for partially disambiguating them, but that's way more advanced than the question you are asking and is likely to require literally years to perfect.
This is some novel knowledge to me. Got two questions about this:
Thanks!
Incorrect distance or orientation will not get a scoring bonus.
This is in the bwa documentation (https://bio-bwa.sourceforge.net/bwa.shtml):
Dear d-cameron and cmdcolin,
Thank you so much for your kind comments and advice.
I confirmed this by looking at my alignments of WGS data in IGV focusing on junctions of repetitive elements and non-repetitive regions. Actually, "unique + multi" mate pairs at such junctions were labeled as "unique", while "multi + multi" mate pairs inside repetitive elements were mapped as "multi".
No, I am analyzing human WGS vs mouse WGS but not a mixture of human and mouse samples (my colleague is using "Xenome" for this purpose). I am sorry for my ambiguous question.