I am using the 1000 genomes human assembly reference file
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
and It was working OK with bwa but not with other programs that expected only ACTGN letters.
Seems that in chr3 there is a M in one line and two R in another line.
$ perl -lnE 'print if /[^ACTGN]/' human_g1k_v37.fasta >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 >2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 >3 dna:chromosome chromosome:GRCh37:3:1:198022430:1 CGCTACATAGCTGMCTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA >4 dna:chromosome chromosome:GRCh37:4:1:191154276:1 >5 dna:chromosome chromosome:GRCh37:5:1:180915260:1 >6 dna:chromosome chromosome:GRCh37:6:1:171115067:1 >7 dna:chromosome chromosome:GRCh37:7:1:159138663:1 >8 dna:chromosome chromosome:GRCh37:8:1:146364022:1 >9 dna:chromosome chromosome:GRCh37:9:1:141213431:1 [...]
Assuming that they are ambiguity IUPAC code, why they are there if it is the reference (isn't GRCh37 haploid)? and why only there? why not to put an N? Is that on purpose?
I have accepted your answer for rewarding your time in answering. I have now a confirmation from 1000genomes researchers that this seems an error propagated from former genome reference sequences. They will fix the sequence in the ftp but is tricky because the bam files headers contain the md5 checksum of the reference sequences.
thanks Pierre, but the problem is that this sequence is suppose to be the HUMAN GENOME REFERENCE prepared and used for the 1kg. So why it doesn't have the original Ns?
It is clear that the ambiguities remain but always as an N (I assume) in the reference that is haploid.
because there was a strong ambiguity when the reference was sequenced. Some ambiguities remain.