I was comparing the refseq (GCF_000001405.39_GRCh38.p13_genomic.fna) and Ensembl (Homo_sapiens.GRCh38.dna.toplevel.fa) human genome for GRCh38.p13 and do not understand what Ensembl has done to end up with such wierdness.
First of all the size was strange: 3.1Go (Refseq) vs 60Go (Ensembl) I was thinking, Ensembl must have lot of interesting infomation missing in refseq... but actually not.... most of the data are Ns.
Total sequences 639 (refseq) vs 639 (Ensembl)
Total letter: 3272089205 (refseq) vs 63147197748 (Ensembl)
Total N: 161368591 (refseq) vs 60044190151 (Ensembl)
So the Ensembl genome contains 60 billion of Ns. Is there any reason for that?
May someone try to count the number of Ns in the ensembl genome to check if I counted properly and the error in not from my side?
Why the number of sequence is the same then?
It's really surprising that even adding
regions not assembled into chromosomes and N padded haplotype/patch regions
Ensembl has less nucleotide than Refseq (excluding Ns):The fasta format used for
haplotype/patch regions
is really inefficient. Storing 60 billion of Ns sounds crazy...If you want an authoritative answer then consider sending in a help desk ticket. There must be a specific application or a reason to produce the
toplevel
format though it is confusing for many users.