How Are Human Sex Chromosomes Handled When Mapping Reads?
1
3
Entering edit mode
11.5 years ago
robert ▴ 30

When a mapper (like BWA) is used for Whole Human Genome reads, how are the X and Y chromosomes typically treated? In particular, are homologous regions between the two masked out in the Y chromosome to prevent ambiguous mapping in males and nonsensical mapping in females? Or are there other techniques used to resolve these issues?

If masking is used, is there a published definition of the regions available?

Can you provide a reference to any published articles on this subject?

mapping human • 5.7k views
ADD COMMENT
3
Entering edit mode

Check out the README for the 1000g reference genome. Read the bottom section. It answers most of your questions.

ADD REPLY
0
Entering edit mode

Perfect! Thanks for the reference - that's exactly what I was looking for.

ADD REPLY
0
Entering edit mode

@lh3 also made this helpful post somewhat recently (thanks!): http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

ADD REPLY
4
Entering edit mode
11.5 years ago
Gabriel R. ★ 2.9k

how are the X and Y chromosomes typically treated?

like any other chromosome


In particular, are homologous regions between the two masked out in the Y chromosome to prevent ambiguous mapping in males and nonsensical mapping in females?

No but duplicated regions will give you low mapping quality. Reads will multiple equally good match will be placed at random by BWA


Or are there other techniques used to resolve these issues?

What issues ? I think you need to add some details as to what do you mean by issues.


If masking is used, is there a published definition of the regions available?

What do you seek to mask ? I think you need to provide background on your project a bit.


Can you provide a reference to any published articles on this subject?

no, no time, sorry.

ADD COMMENT
0
Entering edit mode

I am looking at possible modifications to mapping software that could make variant calling in the X and Y chromosome regions more accurate. Perhaps an example would shed more light on the issue I'm trying to get at.

The pseudoautosomal regions in Y (PAR1 & PAR2) exchange DNA with homologous regions in X, thus acting diploid. It would seem ideal for variant calling to be likewise diploid here, making homozygous or heterozygous calls based on ALL the reads aligning well to these regions, as is done for the autosomal chromosomes, with the MAPQ of these reads being high unless they align nearly as well to other unrelated regions. But if full X and Y reference sequences are used, then supporting reads will map arbitrarily to the homologous X and Y regions, and even if variant calling tools merge them back into a single pileup, their maximum-likelihood analysis would be crippled by near-zero MAPQ scores. It seems plausible to me that if these regions were masked to N’s in the Y chromosome, for example, associated reads would get mapped into only the X regions, with meaningful MAPQ, and accurate diploid variant calling would thereby be facilitated. I know that Complete Genomics, for example, does make bi-allelic variant calls in the pseudoautosomal regions of the X and Y chromosomes, with positions always reported in the X chromosome. I am trying to discover whether something like the Y masking I imagine, or some other method, is used in current practice to achieve accurate variant calling in pseudoautosomal or other homologous regions of the X and Y chromosomes. Rather than manipulating the reference genome to mask out certain regions of Y, it might make sense to build that intelligence into the mapper via a list of regions to mask in Y.

ADD REPLY
0
Entering edit mode

Ok, I had a hunch that was the problem you were referring to. I guess then the question is why is this a problem ? Could you elaborate on your project ? Are you interested in the pseudoautosomal regions ?

ADD REPLY
0
Entering edit mode

I am developing a new mapping/aligning algorithm, and this inquiry is just an attempt to understand an aspect of how the mapper will be used, i.e. if users would sometimes give special treatment to the sex chromosomes in the reference sequence. I am a computer scientist and in the process of coming up to speed on the genomics applications side of things.

I am starting to infer from your responses, which I sincerely appreciate, that no special treatment is common, and normally a user would include both X and Y chromosomes in the reference without masking or other modification, and resulting low MAPQ and spotty coverage in X/Y homologous regions is expected and acceptable. If so, that is good to know. I don't want to try to solve a non-problem.

But if you know of any published work or discussions on this issue, I would love to be able to dig into it further offline.

Thanks for your insights.

ADD REPLY
0
Entering edit mode

ah ok. I am a computer scientist too :-)

Honestly, I think that problem should not be corrected during mapping. Mapping should be impervious to the presumed genomic structure. This problem should be handled downstream during genotyping.

Besides, I think in general people would be more interested in correctly calling regions on the Y rather than improving SNP calling on regions that will not be trusted.

ADD REPLY

Login before adding your answer.

Traffic: 1440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6