WTF with the ensembl human genome?
1
0
Entering edit mode
3.8 years ago
Juke34 8.9k

I was comparing the refseq (GCF_000001405.39_GRCh38.p13_genomic.fna) and Ensembl (Homo_sapiens.GRCh38.dna.toplevel.fa) human genome for GRCh38.p13 and do not understand what Ensembl has done to end up with such wierdness.

First of all the size was strange: 3.1Go (Refseq) vs 60Go (Ensembl) I was thinking, Ensembl must have lot of interesting infomation missing in refseq... but actually not.... most of the data are Ns.

Total sequences 639 (refseq) vs  639 (Ensembl)
Total letter: 3272089205 (refseq) vs 63147197748 (Ensembl)
Total N: 161368591 (refseq) vs 60044190151 (Ensembl)

So the Ensembl genome contains 60 billion of Ns. Is there any reason for that?

May someone try to count the number of Ns in the ensembl genome to check if I counted properly and the error in not from my side?

genome ensembl refseq • 1.6k views
ADD COMMENT
4
Entering edit mode
3.8 years ago
GenoMax 147k

Ensembl top level genome contains haplotypes etc so it is huge compared to "primary" sequence. primary is what you need to compare with RefSeq. Let me find the relevant ReadMe file.

Edit: Here is the README link.

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.

and

-----------------
PRIMARY ASSEMBLY
-----------------
Primary assembly contains all toplevel sequence regions excluding haplotypes
and patches. This file is best used for performing sequence similarity searches
where patch and haplotype sequences would confuse analysis. If the primary
assembly file is not present, that indicates that there are no haplotype/patch
regions, and the 'toplevel' file is equivalent.
ADD COMMENT
0
Entering edit mode

Why the number of sequence is the same then?

ADD REPLY
0
Entering edit mode

It's really surprising that even adding regions not assembled into chromosomes and N padded haplotype/patch regions Ensembl has less nucleotide than Refseq (excluding Ns):

3 103 007 597 <= Ensembl
3 110 720 614 <= Refseq

The fasta format used for haplotype/patch regions is really inefficient. Storing 60 billion of Ns sounds crazy...

ADD REPLY
0
Entering edit mode

If you want an authoritative answer then consider sending in a help desk ticket. There must be a specific application or a reason to produce the toplevel format though it is confusing for many users.

ADD REPLY

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6