paired end mapping with one end being unique and the other end multiple
1
0
Entering edit mode
2.7 years ago
schu • 0

Hi all,

This may be a question already and repeatedly asked before, but I could not find an answer relevant to my issue...

I would like to map human/mouse genome sequence data of paired end reads (150 bp x 2 etc.) allowing one end being "uniquely" mapped and the other end "multiply" mapped in order to increase mappability to repetitive regions.

Is this (one end uniquely and the other end multiply mapped) possible for common mapping softwares such as bowtie2, bwa etc. by properly setting some option parameters, or do I need to manually filter (by awk etc.) the output, for which I allow for multimapping reads of both ends?

Thank you for your kind help and advice in advance. schu is offline Reply With Quote

paired-end genome ngs • 2.0k views
ADD COMMENT
3
Entering edit mode

It is unclear what you are actually trying to achieve.

allowing one end being "uniquely" mapped and the other end "multiply" mapped in order to increase mappability to repetitive regions

As Istvan Albert already stated, this is already done by most aligners. Specifically, most aligners will preferentially place a multi-mapping read in the location that is consistent with location of the uniquely mapping mate based on the library fragment size distribution. Most popular aligners will even report a non-zero mapq for the multi-mapping mate (since the read pair, as a whole, is not multimapping).

I allow for multimapping reads of both ends

Aligners will still preferentially align the both reads in the read pair in a consistent location. If there's no one best location then they'll output a random location with mapq=0. The exception to this is if the sequence is highly abundant in the reference (e.g. alpha satellite repeats) as, for performance reasons, the aligners will have a limit on the number of candidate locations an alignment 'seed' can have. For bwa this is the -c parameter and defaults to 500.

in order to increase mappability to repetitive regions

That's not how alignment works. If your reference genome has 10 identical copies of a particular region (that is large relative to the fragment size/read length), there's nothing any aligner can do to increase the mappability because will be 5 regions that all have identically good alignments. By default, aligners will randomly choose one of the 5. You could ask the aligner to output all the candidate locations (either as secondary alignments or, for bwa, the XA tag using -h), but that doesn't actually change the mappability of any of the regions - it's still an ambiguously aligned read and there's nothing the aligner can do about it^.

to map human/mouse genome sequence data

Is this data (such as mouse xenograft data) where your input reads are a mixture of mouse and human cell that you're trying to align to a joint mouse+human reference? Due to high level of homology across the genomes, you'll get a high multi-mapping rate if you just aligned to a joint reference. This sort of data requires more sophisticated and there are specialised pipeline designed to deal with this.

As a general rule, try to look for and use existing software whenever possible. High quality software development time consuming and writing your own frequently leads to an inferior solution to a general problem. Considering writing your own when there are no options out there, or you are extremely familiar with the state-of-the-art and have concrete solutions to the limitations of the current software.

^ When the homologs aren't identical in the sample being sequenced, there are some sophisticated techniques that can be used for partially disambiguating them, but that's way more advanced than the question you are asking and is likely to require literally years to perfect.

ADD REPLY
1
Entering edit mode

most aligners will preferentially place a multi-mapping read in the location that is consistent with location of the uniquely mapping mate based on the library fragment size distribution.

This is some novel knowledge to me. Got two questions about this:

  1. Will the multi-mapping read be strictly located in the way that 1) it falls within the fragment size range to the uniquely mapping mate; and 2) it face towards the uniquely mapping mate on the opposite strand?
  2. how would those aligners find out about the library fragment size ( = insert length + adapter length x 2) distribution if not manually input to them? I also failed to find such an input option in either bwa or bowtie2

Thanks!

ADD REPLY
0
Entering edit mode
  1. Yes, the placement has to be 'concordant' with library fragment size distribution.

Incorrect distance or orientation will not get a scoring bonus.

  1. how would those aligners find out about the library fragment size

This is in the bwa documentation (https://bio-bwa.sourceforge.net/bwa.shtml):

Estimating Insert Size Distribution

BWA estimates the insert size distribution per 2561024 read pairs. It first collects pairs of reads with both ends mapped with a single-end quality 20 or higher and then calculates median (Q2), lower and higher quartile (Q1 and Q3). It estimates the mean and the variance of the insert size distribution from pairs whose insert sizes are within interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair considered to be properly paired (SAM flag 0x2) is calculated by solving equation Phi((x-mu)/sigma)=x/Lp0, where mu is the mean, sigma is the standard error of the insert size distribution, L is the length of the genome, p0 is prior of anomalous pair and Phi() is the standard cumulative distribution function. For mapping Illumina short-insert reads to the human genome, x is about 6-7 sigma away from the mean. Quartiles, mean, variance and x will be printed to the standard error output.

ADD REPLY
0
Entering edit mode

Dear d-cameron and cmdcolin,

Thank you so much for your kind comments and advice.

Most popular aligners will even report a non-zero mapq for the multi-mapping mate (since the read pair, as a whole, is not multimapping).

I confirmed this by looking at my alignments of WGS data in IGV focusing on junctions of repetitive elements and non-repetitive regions. Actually, "unique + multi" mate pairs at such junctions were labeled as "unique", while "multi + multi" mate pairs inside repetitive elements were mapped as "multi".

Is this data (such as mouse xenograft data) where your input reads are a mixture of mouse and human ...

No, I am analyzing human WGS vs mouse WGS but not a mixture of human and mouse samples (my colleague is using "Xenome" for this purpose). I am sorry for my ambiguous question.

ADD REPLY
5
Entering edit mode
2.7 years ago

Multiple mapping should not even require anything on your behalf, it should work like that by default. But collecting all the positions may be a bit more complicated and may require parsing/processing the BAM file with a custom script.

The alternative locations will be specified in the XA tags (if I recall that correctly) and you won't get a separate alignment for each position.

Here is where the choice of aligner may matter more, bowtie2 usually offers more formatting options, perhaps includes all alternative locations, bwa almost certainly provides only the XA tags.

ADD COMMENT
2
Entering edit mode

allowing one end being "uniquely" mapped and the other end "multiply" mapped

This is the bit of requirement mentioned in original post that is unlikely to be supported by any aligner. OP will need to filter the results after the alignments to satisfy this requirement.

ADD REPLY
0
Entering edit mode

Dear Istvan Albert and GenoMax,

Thank you so much for your kind replies and comments. All right, I will now try to parse SAM/BAM of bowtie2 by using flags and tags etc. Thanks again.

ADD REPLY
2
Entering edit mode

bwa almost certainly provides only the XA tags

Be aware that the XA tag (and indeed all X, Y, Z* tags) has an implementation-defined meaning. It's used as a light-weight secondary alignment field by bwa but there is no requirement that any other caller either uses the field in this way, or even at all.

ADD REPLY

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6