How Many Bytes Is A Complete Human Genome??
2
6
Entering edit mode
13.8 years ago
Brian ▴ 60

As a computationalist newby to this field of science and not a bio-informaticist I have a question which relates to configuration of IT infrastructure in a start-up genomic enterprise. If I need to receive an entire human genome (either in one blob or in many bloblets!) from someone what are the best methods, formats for transmission and how big would it be? IT may be currently impractical to send them around attached to an email right now ;-} but what if I needed to design an IT infrastructure for the future where it was possible to archive and transmit whole human genomes?


As Jarretinha points out the data from the sequencer could be much bigger than the assembled genome. But how much bigger for a NGS machine? Any experiences?

next-gen sequencing genome sequence format • 28k views
ADD COMMENT
1
Entering edit mode

I think you need to think a little bit more about your question. Nowadays people at the drybench receive numerous huge files with genomes in redundant pieces. Do you want a raw, high quality genome sequence or a real example of a NGS machine output?

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Justo to mention, NGS human data for personal genomics is only useful if you're able to reconstruct haplotypes for all cromosome pairs. Remember that we're diploids and dominance, mendelism counts a lot. 3e9 is the haploid size without heterochromatin!!! Our true genome is much bigger! Most seq files in databases are just a consensus. So, you really must specify your needs & aims in order to get a useful answer.

ADD REPLY
8
Entering edit mode
13.8 years ago

The size of the human genome is about 3E9 bytes/bases.

The UCSC stores the human genome as a set of gzipped/zipped FASTA files. See:

for example, the size of the zipped genome is about 899M (chromFa.zip)

When the 4 bases are packed into one byte ( .2bit format) the size is 770M (hg18.2bit) , but you'll need an extra tool to decypher the data.

some papers about DNA compression have also been published:

So, you can easily exchange a genome via a FTP site, on a DVD, on dropbox.com, etc...

ADD COMMENT
3
Entering edit mode

I'm going to be a pedant and point out that it cannot be both 3e9 bytes and bases, because a byte is 8 bits but a base is two bits. So a genome is 6e9 bits, or 7.5e8 bytes. This, of course, ignores methylation though.

ADD REPLY
0
Entering edit mode

Excuse my genomic ignorance, what do you mean by, "ignores methylation"?

ADD REPLY
0
Entering edit mode

base is two bits

A base may be two bits, though doesn't it need a position too, to be represented? I suppose you could use the byte offset in the binary files to represent position, though that seems potentially more restrictive than explicitly storing the position.

ADD REPLY
0
Entering edit mode

That's precisely how images are stored, because it would take many times the amount of information to explicitly store the position.

In a 1080p image, there are 2,073,600 pixels (assuming 1 channel), to store the position of each 8 bit pixel, you'd need to include 21 extra bits of information, per pixel (actually, 42, because it's a 2d array, not 1)

So, 42 bits per pixel, times 2,073,600 = 87,091,200 bytes to store the position.

Now, this is images, and it would be a 1D array, not 2, and it depends on the number of bases per gene, so there's a lot of variability there, but my general point stands.

ADD REPLY
2
Entering edit mode
13.8 years ago
Casbon ★ 3.3k

See my comment on Pierre's answer that a genome is 3e9 bases which is 6e9 bits uncompressed. However, if you send it as ASCII text it would be 3e9 bytes.

Your question asks what the best method of sending whole human genomes. This would be best done as a diff to the reference genome. About 0.1% of the genome varies between individuals, which is about 3e6 bases (this ignores the rare variants). These genotypes could be encoded in 6e6 bits (0.7 MB) if both parties know the variant locations. A useful encoding would be larger, for example, the illumina SNP chip is about 35MB uncompressed https://github.com/msporny/dna/blob/master/ManuSporny-genome.txt

Copy number variants and tandem repeats have a natural compression (i.e. a run length encoding), but structural variants would need a diff style encoding.

All in all, I would think an efficiently compressed efficient human genome diff would be attachable by email, but I think it's too soon to guess the extra weight required for the haplotype information.

ADD COMMENT
1
Entering edit mode

There was an interesting paper published recently about reference-based compression: http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110.abstract

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6