Question

1000 Genomes Human Reference Assembly File With Non Actgn Chars In Chr 3, Why?

3

Entering edit mode

14.2 years ago

Pablo Marin-Garcia ★ 2.0k

I am using the 1000 genomes human assembly reference file

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

and It was working OK with bwa but not with other programs that expected only ACTGN letters.

Seems that in chr3 there is a M in one line and two R in another line.

 $  perl -lnE 'print if /[^ACTGN]/' human_g1k_v37.fasta
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
CGCTACATAGCTGMCTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG
CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA
>4 dna:chromosome chromosome:GRCh37:4:1:191154276:1
>5 dna:chromosome chromosome:GRCh37:5:1:180915260:1
>6 dna:chromosome chromosome:GRCh37:6:1:171115067:1
>7 dna:chromosome chromosome:GRCh37:7:1:159138663:1
>8 dna:chromosome chromosome:GRCh37:8:1:146364022:1
>9 dna:chromosome chromosome:GRCh37:9:1:141213431:1
[...]

Assuming that they are ambiguity IUPAC code, why they are there if it is the reference (isn't GRCh37 haploid)? and why only there? why not to put an N? Is that on purpose?

genome human • 4.4k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 14.2 years ago by Pablo Marin-Garcia ★ 2.0k

score 2 · Answer 1 · 2011-06-21

2

Entering edit mode

14.2 years ago

Pierre Lindenbaum 166k

The reference genome for : hg19_dna range=chr3:60830521-60830580 5'pad=0 3'pad=0 is:

CGCTACATAGCTG*N*CTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG

your sequence:

CGCTACATAGCTG*M*CTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG

the 'M' (A or C: your sample was heterozygous) was a 'N' on the reference sequence.

2nd: the reference sequence

CCNNGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAA

your sequence:

CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA

again the reference contains two 'N' on the reference sequence.

ADD COMMENT • link 14.2 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

I have accepted your answer for rewarding your time in answering. I have now a confirmation from 1000genomes researchers that this seems an error propagated from former genome reference sequences. They will fix the sequence in the ftp but is tricky because the bam files headers contain the md5 checksum of the reference sequences.

ADD REPLY • link 14.2 years ago by Pablo Marin-Garcia ★ 2.0k

1

Entering edit mode

thanks Pierre, but the problem is that this sequence is suppose to be the HUMAN GENOME REFERENCE prepared and used for the 1kg. So why it doesn't have the original Ns?

ADD REPLY • link 14.2 years ago by Pablo Marin-Garcia ★ 2.0k

1

Entering edit mode

It is clear that the ambiguities remain but always as an N (I assume) in the reference that is haploid.

ADD REPLY • link 14.2 years ago by Pablo Marin-Garcia ★ 2.0k

0

Entering edit mode

because there was a strong ambiguity when the reference was sequenced. Some ambiguities remain.

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 166k