Question

Why Does The Chr1.Fa Fasta File Have A Bunch Of Ns And Why Is Some Of The Dna In Lower Case Vs. The Rest In Upper Case?

1

Entering edit mode

11.7 years ago

sameer ▴ 10

Hi,

I have a couple of questions about the chr1.fa FASTA file at the link below:

Q1) Why does the beginning of the file have a whole bunch of N characters? The IUPAC code for DNA sequences says that N means any nucleotide base, so does this mean that the sequencer equipment could not correctly pull the 1-letter code for Chromosome 1's beginning? Also, starting line 3550 or line 76,907 there are like a hundred more lines of Ns.

Q2) Why are parts of the DNA in lower case, while other parts are in upper case?

Link to the Chromosome 1 file: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/?C=S;O=A

fasta • 6.4k views

ADD COMMENT • link updated 2.6 years ago by jena ▴ 320 • written 11.7 years ago by sameer ▴ 10

1

Entering edit mode

I detect messing effort in reading the documentation found on that very same site.

ADD REPLY • link 11.7 years ago by Martin A Hansen 3.0k

0

Entering edit mode

I have the same issue! How did you resolve it?

ADD REPLY • link 3.1 years ago by j_eag ▴ 10

score 9 · Answer 1 · 2013-03-17

The Ns at the end of the chromosomes represent unsequenced heterochromatin.
On the page http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ you can read: "Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. RepeatMasker was run with the -s (sensitive) setting. Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and RELEASE 20090120 of library RepeatMaskerLib.embl". So, the sequence has been what is called "soft-masked", i.e. the repeats are shown in lower case. Another way of masking "hard-masking", in which repeats are shown as Ns.