There certainly are tools available that correctly handle sex chromosomes. With the RTG suite of tools, your human reference can contain a configuration file that specifies the autosomes, sex chromosomes, PAR regions etc, and this file is consulted during both mapping and variant calling. During processing for a sample, you specify the sex (if known), and it Just Works.
For example, during mapping (rtg map
), when processing a sample specified as female, the aligner will not attempt to map reads to the Y chromosome. Similarly it will not map reads to the PAR region on the Y chromosome for males (although most human references already have this region masked out anyway). When performing variant calling (rtg snp / rtg family / rtg population / rtg somatic
) on a sample specified as male, the caller automatically uses haploid calling for the X and Y chromosomes (except for the PAR regions, where diploid calling is carried out). When performing joint calling of multiple samples that form a pedigree, either a single family (rtg family
), or larger multi-generation pedigree (rtg population
), the variant calling utilizes Mendelian inheritance in the Bayesian models to further inform the variant calling (including appropriate behaviour on the sex chromosomes). Of course, you can manually override any of this if you want.
For all of the modes, there is a validation tool that executes automatically at the end of a mapping job, and it can be explicitly invoked (rtg chrstats
) if you have split your mapping into multiple smaller jobs and want it to use aggregate statistics, which analyses the coverage of the autosomes and sex chromosomes to indicate whether the sample may have been mapped with a sex that does not correspond to what the sample actually is (for example, due to sample mislabelling, incorrect pedigree information, or chromosomal abnormalities such as XXY, trisomy, etc).
Give it a go, RTG Core (which contains all that goodness) is free for non-commercial use.
that is good point about the PAR region! your answer is much more complete than mine. Do you know of any tools that would look at both coverage and SNPs to make an inference on XX/XY/XYY/XXY? Also since we are on this topic, would it be worth worrying about contamination from sample prep (say a male was handling a female sample, could that make it seem XXY?)
To tell the gender of a sample, we don't need to do het calling. Just compare the average read depth in non-PAR non-repetitive regions on X to that on an autosome as you said. This can be done in ~1 minute with random sampling on a mapped BAM. Of course combining depth with het calling is better, but it is more complex and takes more time.
Thank you! Is there a resource that provides the coordinates for both the PAR and non-PAR non-repetitive regions on X?
As long as the contamination is not terrible - say, under 2% - it shouldn't really affect an analysis with reasonable cutoffs. Variant callers should not call a variant with under 1:20 allelic ratio heterozygous (since that has at best a 1:1000000 chance of being correct, ignoring ref-bias), and +-2% will not affect your ability to quantify the coverage ratio of the unique portions of the X and Y chromosome to determine XX/XY/XXY/XYY. That said, in a human test, I would consider 2% contamination from another person to be completely unacceptable; if that's detected it should be redone starting with the blood-draw.
I do not know of any tools for determining sex chromosome numbers for non-XX/XY individuals, but probably. BBMap used to have a mode to declare reads mapping to the PAR to be uniquely mapped to the X, rather than ambiguously mapped to X and Y, inspired by Complete Genomics' "PAR-called-in-X" classification of variations found in the PAR - which is a reasonable classification, since those genes have 2 copies and thus should be treated more like autosomes. But, I don't study humans anymore so that mode went away :)
Also, on the subject of contamination -
Contamination from handlers is not a big issue unless the library is highly amplified. JGI sequences only non-human organisms, and I wrote the human-contaminant removal pipeline; for the vast majority of libraries, with millions of reads, there are only 10s to 100s that map to human. So, having more than a few ppm of contamination from the handler is a rare event that should be reprocessed (when dealing with human data where the contaminant reads cannot be easily removed).
Cross-contamination from libraries on the same plate, or multiplexed together, is a MUCH bigger issue. When the robotic systems have problems (samples physically sloshing into each other), this can lead to >10% contamination; but even barcode misidentification and barcode reagent contamination can lead to >1000ppm contamination. MiSeq has too small a volume to use multiplexing on human exome-capture, so it's fairly safe, but HiSeq is particularly prone to this kind of issue. Therefore, I highly recommend using dual barcodes (at a minimum) and being conservative (favoring het calls) when dealing with multiplexed data.
Talking about contaminations, you haven't addressed my concern in comment hg19 vs hg38 in two pictures. If you still use your old pipeline, JGI data might be contaminated with human sequences that have no matches in other primates. This might be just a very minor concern, though, as the human reference genome is incomplete anyway. If you still think you are right, it would be good to give concrete examples of contaminations and talk to GRC. GRC is highly experienced and will take the issue very seriously. That is for the better of JGI, GRC and everyone who use the human reference genome. Thanks.
Ah, well, I'll get around to it soon. It's just that it will take me a day or so to isolate the exact regions again.