Q1. Running
zcat ALL.chrX.BI_Beagle.20100804.genotypes.vcf.gz | grep -v ^## | cut -f 345 | cut -d ':' -f 1 | grep -v '\./\.' | grep -v '|' | head
yields
NA18981
0/0
0/0
1/1
0/0
0/0
0/0
0/0
0/0
0/0
while running
zcat ALL.chrX.BI_Beagle.20100804.genotypes.vcf.gz | grep -v ^## | cut -f 345 | cut -d ':' -f 1 | grep -v '\./\.' | head
yields
NA18981
0|0
0|0
0|0
0|0
0|0
0|0
0|0
1|0
0|0
Is NA18981 phased, or is it not? Is it partially phased? If yes - then what rule/convention explains this partiality? (I know that microsatellite calls are unphased in phased genomes, but I believe I haven't seen any in this file.)
Q2. For somatic chromosomes (I've only checked this on chrs 1 and 2, but I assume this pattern is characteristic for all autosomal chromosomes) all 629 samples appear to be phased - that is, their genotypes at all positions are either unknown ./.
or phased (e.g. 0|1
). So are all 629 samples really phased on all somatic chromosomes?
Somewhat related: http://biostar.stackexchange.com/questions/5315/phased-and-unphased-genotypes-in-vcf-files-does-the-order-of-alleles-matter
You should check the documentation of the programs used to phase the data. They certainly contain a section about phasing chr X and haploids. Chr X is hemizygous in certain positions. Phasing can be problematic.
They used Beagle (judging from the chrX filename), and I'll have to read its manual sooner or later. Jarretinha, did you mean to say that 1000 genomes project considers pseudo-autosomal segments of Y as diploid segments on the corresponding chrX coordinates? That would make sense (as there is no Y chromosome anywhere in the data), but then... Why there are no non-diploid genotypes on chrX? (They should have no slash/pipe in them, just a number or a dot.)
Please be more descriptive!
+1 I actually think the question is very clear.
ok, the question is very clear now after the edit. I've removed the -1.