Question

Sea of Ns in downloaded mm10 from NCBI and GRCm38.p6 primary sequence from Ensemble

0

Entering edit mode

5.4 years ago

akh22 ▴ 120

I download mouse ref genomes, mm10(GCF_000001635.26_GRCm38.p6_genomic.fna.gz)NCBI and GRCm38.p6 (Mus_musculus.GRCm38.dna.primary_assembly.fa,gz) from Ensembl sites and looked at the sequence inside. Chr. 1 through 19 had all Ns. Only Mt have valid nucleotide sequences. I'd appreciate if anybody explains what is going on ?

A following is an index of GRCm38.p6.

1       195471971       56      60      61
10      130694993       198729952       60      61
11      122082543       331603253       60      61
12      120129022       455720564       60      61
13      120421639       577851795       60      61
14      124902244       700280520       60      61
15      104043685       827264527       60      61
16      98207768        933042331       60      61
17      94987271        1032886953      60      61
18      90702639        1129457403      60      61
19      61431566        1221671810      60      61
2       182113224       1284127292      60      61
3       160039680       1469275793      60      61
4       156508116       1631982857      60      61
5       151834684       1791099498      60      61
6       149736546       1945464817      60      61
7       145441459       2097697029      60      61
8       129401213       2245562569      60      61
9       124595110       2377120525      60      61
MT      16299   2503792275      60      61
X       171031299       2503808902      60      61
Y       91744698        2677690778      60      61

RNA-Seq rna-seq Reference Genome • 2.3k views

ADD COMMENT • link 5.4 years ago by akh22 ▴ 120

0

Entering edit mode

You are sure it's all Ns, and not just the first few million bases of each chromosome that are all Ns?

ADD REPLY • link 5.4 years ago by swbarnes2 14k

0

Entering edit mode

Chr. 1 through 19 had all Ns. Only Mt have valid nucleotide sequences.

If you are saying the sequences of chr1 to chr19 are entirely composed of Ns, then you are wrong. Here are the stats for this genome release:

Total bases: 2,818,974,548
Total non-N bases: 2,739,538,976

The beginning of the chromosomes is represented by lots of Ns, one has to scroll / page down considerably before seeing non-N bases.

edit

Number of Ns, total number of bases, and percentage of Ns per chromosome for the GRCm38 assembly:

1       3562779 195471971       0.0182265
2       3786573 182113224       0.0207924
3       3640825 160039680       0.0227495
4       4452505 156508116       0.028449
5       3915010 151834684       0.0257847
6       3400003 149736546       0.0227066
7       3586052 145441459       0.0246563
8       3789781 129401213       0.0292871
9       3438092 124595110       0.0275941
10      3627331 130694993       0.0277542
11      3336598 122082543       0.0273307
12      3206602 120129022       0.026693
13      3300446 120421639       0.0274074
14      3460134 124902244       0.0277027
15      3390370 104043685       0.032586
16      3188010 98207768        0.0324619
17      3279809 94987271        0.0345289
18      3250005 90702639        0.0358314
19      3225710 61431566        0.052509
X       7543304 171031299       0.0441048
Y       3620000 91744698        0.0394573
MT      0       16299   0

ADD REPLY • link 5.4 years ago by h.mon 35k

score 1 · Answer 1 · 2019-08-01

Chr. 1 through 19 had all Ns. Only Mt have valid nucleotide sequences.

this is wrong, you 'just' have some 'N' for the unknown and the telomeric regions

$ wget -q -O - "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/chr1.fa.gz" | gunzip -c | grep -v "^>" | cat -n | grep -v NNNNNNNNNNNNN -m3 -C 3
 59998  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 59999  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 60000  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 60001  ttctgtttctattttgtggttactttgaggagagttggaattaggtcttc
 60002  tttgaaggtctggtagaactctgcattaaacccatctggtcctgggcttt
 60003  tttttttttttttttttttttttgggtgggagactattgatgactgcctc

score 0 · Answer 2 · 2019-08-02

0

Entering edit mode

5.4 years ago

akh22 ▴ 120

My bad. There was nothing wrong with the sequence. It turned out that I was using an OSX wrapper fo gzip and sometimes it corrupts .gz file for some unknown reason in my Mac. I manually uncompressed file and they are fine.

Thanks.

ADD COMMENT • link 5.4 years ago by akh22 ▴ 120