If you download the reference from 1000g :
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
Why is there a double "RR" in chr 3 ?
zcat human_g1k_v37.fasta.gz.1 | grep -n R
1:>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
4154180:>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
8207504:>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
9221351:CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA
Any more surprises in those files ?
'R' is IUPAC : it's A or G.
Is it possible that this is a diploid genome and ambiguity codes are being used to represent sites where the paternal and maternal chromosomes differ? I haven't looked too much at the 1000g data.