Question

What Are Numbers Every Bioinformatician Should Know?

23

Entering edit mode

11.3 years ago

brentp 24k

There is a well-known set of numbers that every computer scientist should know by Jeff Dean. See http://www.quora.com/What-are-the-numbers-that-every-computer-engineer-should-know-according-to-Jeff-Dean

What are the sequencing related numbers that a bioinformatician should know? For example,

Reads from Illumina Hi-Seq 2000 lane: 350 million
Number of reads to cover 3.0GB of sequence at 30X: 900 million (for 2x100bp)
Cost per Hi-Seq 2000 lane: $2,500

If you put your answer in that format, I'll aggregate the best ones.

In addition to whatever you deem important, I'm interested in recommended read-counts for RNA-Seq, ChIP-Seq, file sizes for BAMs and fastq.gz's per number of reads, processing times for aligning to 3.0GB of reference for common aligners, etc.

knowledge career • 9.7k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 11.3 years ago by brentp 24k

9

Entering edit mode

"3": Never write code after 3 beers.

ADD REPLY • link 11.3 years ago by Aaronquinlan 12k

3

Entering edit mode

Correct. Two is optimal. Relevant xkcd: http://xkcd.com/323/

ADD REPLY • link 11.3 years ago by Daniel ★ 4.0k

1

Entering edit mode

At this point, there's always a relevant xkcd :)

ADD REPLY • link 11.3 years ago by Chris Miller 22k

2

Entering edit mode

Might be some relevant numbers here: Bioinformatics "Cheat Sheet"

ADD REPLY • link 11.3 years ago by Chris Miller 22k

1

Entering edit mode

Might want to use the "cost" price of sequencing technologies, because prices vary significantly according to country/continent/provider etc (cost price for a HiSeq lane is quite a bit lower than $2,500)

ADD REPLY • link 11.3 years ago by gammyknee ▴ 210

1

Entering edit mode

we should ask this question every 1-2 years, and see how these numbers change

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Is 350 Mreads realistic for HiSeq? My lab usually counts on 200 Mreads, but we also do lots of different kinds of samples, so we can't count on every library behaving completely consistently, which maybe some groups can.

ADD REPLY • link 11.3 years ago by swbarnes2 14k

0

Entering edit mode

I wonder if it refers to single-end reads - I thought Illumina's reference value was a minimum of somewhere around 140M paired end reads?! (even though you sometimes get much more, of course)

ADD REPLY • link 11.3 years ago by Mikael Huss 4.8k

0

Entering edit mode

I'll look around and then update. I'm not confident on those numbers. I would call 140M read-pairs 280M reads.

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

42 for geeky jokes on parties nobody thinks funny

ADD REPLY • link 11.3 years ago by Michael 55k

score 11 · Answer 1 · 2013-08-20

11

Entering edit mode

11.3 years ago

Neilfws 49k

Percentage of everything that's crap: 90.

ADD COMMENT • link 11.3 years ago by Neilfws 49k

0

Entering edit mode

Funny. Sturgeon's Law is very quotable, was written 60+ years ago, and about Science Fiction, a genre I always liked, yet I never heard it before :)

ADD REPLY • link 11.3 years ago by Eric Normandeau 11k

score 7 · Answer 2 · 2013-08-27

There is a nice review, published just today, resuming some numbers about genetic variability in the human genome:

Barbujani, Ghirotto, Tasso 2013: Nine things to remember about human genome diversity.

Since there has been a discussion on this matter in this thread, I'll resume some of these numbers here. The paper contains many references, which I am too lazy to copy here... So, if you want to know more, just read the paper.

It is important to note that there is only slightly more diversity between individuals of two major continental group, than between individuals of the same population. Craig Venter and Jim Watson (both "caucasian") share less SNVs between themselves that either of them shares with Seong-Jin Kim (a korean scientist).

General Chimp/Human differences:

number of single-nucleotide differences between chimp and human: ~35 millions (+5 milliion insertion/deletion events)
percentage of single-nucleotide changes between chimp and human: 1.23%
percentage of single-nucleotide changes that are fixed in chimp or in human: 1.06% over 1.23%

Variability between individuals:

number of SNVs in humans: ~65 millions
on average, a pair of humans is expected to differ on 1 base every 1000

Fst, pre-1000 Genomes (Fst is a measure of genetic differentiation; 0 -> no differentiation between individuals; 1->highest genetic differentiation):

average Fst among major continental group ranges from 0.05 to 0.13
genetic diversity in humans is far lower than in other primates. Genetic variance in humans is only 5-13% of the variance in other primates.

Fst, after 1000 Genomes

Fst between African and Europeans: 0.071
Fst between African and Asians: 0.083
Fst between Asians and Europeans: 0.052
Fst in Gorilla populations: 0.38
Fst in Chimp: 0.32

Allele sharing across continents

81.2% of SNVs are present in all continental groups (12.4% if we consider haplotypes instead of SNVs)
less than 1% SNVs are specific to a continent (11% if we consider haplotypes instead of SNVs)
only 0.06% of SNVs are specific to Eurasia

Haplotype Sharing across continents

2% haplotype blocks restricted to Asia
2% haplotype blocks restricted to Europe
25% haplotype blocks restricted to Africa

Major Genetic Groups

According to Rosenberg et al (the Structure software), all human individuals can be classified into 5 continental groups
Li et al confirmed the same, on the HGDP panel
however, recent attempts failed to confirm this classification, claiming that it may be due to confounding effects.
races according to the US census system: 15 plus "other races". I recommend you to read this book if you are interested on the matter of races in scientific use

score 6 · Answer 3 · 2013-08-19

The PHRED-offsets for all different FASTQ-standards:

Sanger has 0-93, using ASCII 33 to 126
Solexa/Illumina 1.0 has -5 to 62, using ASCII 59 to 126
Solexa/Illumina 1.3 has 0 to 62 using ASCII 64 to 126
Solexa/Illumina 1.8 has 0-93, using ASCII 33 to 126 (same as Sanger)

This leads to the question: What's a good/reliable quality score to use as a cut-off? I usually use either 30 or 40 in all formats, but I'm open to suggestions. Of course, how high your cutoff is depends on what you want to do with your data. SNP-scoring needs more reliable bases than genome assembly.

Source

score 5 · Answer 4 · 2013-08-21

Fundamental statistics about the genome/transcriptome of the species being studied. For example, for human:

Contig length total 3.2 Gb.
Chromosome length total 3.1 Gb.
Base Pairs: 3,323,950,079
Coding genes: 20,774
Non coding genes: 22,493
Pseudogenes: 14,145
Gene transcripts: 194,846
Short Variants (SNPs, indels, somatic mutations): 54,965,377
Structural variants: 10,266,123

Ensembl makes these stats available in nice regularly updated reports for many species such as: human, mouse, rat, etc...

score 4 · Answer 5 · 2013-08-19

4

Entering edit mode

11.3 years ago

Zev.Kronenberg 12k

Just some ramblings:

log10(x) != log(x)

Know your model organism:

23 human chromosomes

20 mouse chromosomes

~3 million non-reference SNVs per human genome

transition to transversion ratio for human exomes ~ 2

SAM flags

4 - unaligned

12 - both unaligned (mate pair).

ADD COMMENT • link 11.3 years ago by Zev.Kronenberg 12k

1

Entering edit mode

I guess you're counting haploid number of autosomes + one of X/Y for chromosomes. Not to be confused with "all possible chromosomes": for humans 1-22, X, Y + M = 25.

ADD REPLY • link 11.3 years ago by Neilfws 49k

0

Entering edit mode

and you are talking about a diplod organism, for human: (1-22) * 2 + (XX or XY) + M = 47

ADD REPLY • link 11.3 years ago by JC 13k

1

Entering edit mode

Handy tool for explaining all SAM flags and their combinations: http://picard.sourceforge.net/explain-flags.html

ADD REPLY • link 11.3 years ago by Malachi Griffith 20k

0

Entering edit mode

Use it almost every day.

ADD REPLY • link 11.3 years ago by Zev.Kronenberg 12k

score 4 · Answer 6 · 2013-08-20

4

Entering edit mode

11.3 years ago

Woa ★ 2.9k

Check also the database of useful biological numbers: http://bionumbers.hms.harvard.edu/

ADD COMMENT • link 11.3 years ago by Woa ★ 2.9k

score 4 · Answer 7 · 2013-08-20

4

Entering edit mode

11.3 years ago

koukougogo ▴ 60

Taxonomy ID for human: 9606, for mouse: 10090

ADD COMMENT • link 11.3 years ago by koukougogo ▴ 60

0

Entering edit mode

E. coli K12 511145

ADD REPLY • link 11.3 years ago by Asaf 10k

score 4 · Answer 8 · 2013-08-21

4

Entering edit mode

11.3 years ago

Malachi Griffith 20k

Reference genome build/version numbers according to UCSC and NCBI (or sequencing consortium for the species of interest).

For example, hg19 = GRCh37.

The UCSC releases FAQ has a lot of these.

ADD COMMENT • link 11.3 years ago by Malachi Griffith 20k

score 3 · Answer 9 · 2013-08-19

3

Entering edit mode

11.3 years ago

Istvan Albert 102k

Some ideas:

In the Sanger encoding runs of qualities that seem like censored swearing in a comics i.e. $#&%! correspond to very low qualities. Runs of qualities composed of readable letters correspond to very high qualities.
One always has to be aware of the genome size of the organism that is being sequenced. That is usually the first question I ask when someone starts talking about a genome I haven't heard of before.

ADD COMMENT • link 11.3 years ago by Istvan Albert 102k

1

Entering edit mode

Addendum to the genome size: Number of chromosomes is also important

ADD REPLY • link 11.3 years ago by Philipp Bayer 8.8k

score 3 · Answer 10 · 2013-08-21

3

Entering edit mode

11.3 years ago

Adam ★ 1.0k

I always found this book very useful for various 'should know' facts and figures: A Short Guide to the Human Genome

ADD COMMENT • link 11.3 years ago by Adam ★ 1.0k

score 2 · Answer 11 · 2021-01-02

2

Entering edit mode

3.9 years ago

ATpoint 85k

Funny to dig in old threads, when HiSeq2000 was state of the art :)

January 2021: Novaseq 6000 S4 flowcell for 16.000€ yielding 8 billion (yes 8000 million) 2x150bp spots.

ADD COMMENT • link 3.9 years ago by ATpoint 85k

score 1 · Answer 12 · 2013-08-20

1

Entering edit mode

11.3 years ago

Ryan Thompson ★ 3.6k

Roughly 15% of all human genetic variation is attributable to ~~"race"~~ population, and the other 85% exists between people of the same ~~race~~ population.

ADD COMMENT • link 11.3 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

http://en.wikipedia.org/wiki/Human_Genetic_Diversity:_Lewontin's_Fallacy http://westhunt.wordpress.com/2012/01/26/lewontins-argument/ See Secardic's "Race: a social destruction of a biological concept". You'll find a pdf by googling.

ADD REPLY • link 11.3 years ago by Click downvote ▴ 720

0

Entering edit mode

The opposite is true, "race" is a purely social or political construction, coming from a time were nothing was known about population genetics, race is a concept used to promote segregation, exercise power to rule and exploit, oppression and genocide, based on mostly irrelevant phenotypes, coarse geographic borders, and other arbitrary differences. Comparable to phrenology. Its deconstruction does nothing but good for the scientific community and society in general.

ADD REPLY • link 11.3 years ago by Michael 55k

0

Entering edit mode

reading up on it, i might agree that there are good reasons to not use the term about humans, even though it seems hard from a biological perspective to contend that race is any less real in humans than eg dogs. gnxp has a good write-up that i agree with: http://blogs.discovermagazine.com/gnxp/2011/11/on-structure-variation-and-race/#.UhZEr2SshJU

Main points: 1) Human populations can be easily separated into plausible clusters using a random set of genetic markers

2) The differences between human populations are not trivial

ADD REPLY • link 11.3 years ago by Click downvote ▴ 720

0

Entering edit mode

Dog races are a product of artificial selection. They have been produced by dog breeders, after gradually selecting individuals after many populations. Since this is not the case for humans, I don't think that it is the same to apply the concept of race to humans and dogs.
Into how many clusters can human individuals be grouped? What is the logic for choosing the correct number of clusters?
I think that social and environmental differences are far more important than the genetics. In any case, it has been shown that the differences between individuals of the same population are only slightly higher in number than the differences between individuals of two continental groups. See the case of Jim Watson, Craig Venter and Seong-Jin Kim : the latter is more similar to the other two, than the other two between each other.

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I'm not going to continue this boring discussion, but some clever people have argued that many human groups are the products of artificial selection too: see the http://en.wikipedia.org/wiki/The_10,000_Year_Explosion (which got a favorable review in Scientific American)

And comparing in-group and between-group differences is like comparing apples and oranges and seems to be a deceitful way of using statistics: http://evoandproud.blogspot.no/2011/11/apples-oranges-and-genes.html

Henry Harpendings comment is worth noting too: "Notice also that there is a subtle sleight-of-mind in the formulations of Lewontin's 'finding' (It was published previously by Luca Cavalli). Given that 15% of the diversity is among groups and 85% within, we are diploids so of the 85% half is between individuals and half is between alleles within people."

ADD REPLY • link 11.3 years ago by Click downvote ▴ 720

0

Entering edit mode

This was an interesting discussion, thank you for participating to it.

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Where did you get these numbers? It is dangerous to play with genetics and races, without references.

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

3

Entering edit mode

I think it's a misinterpretation of the wikipedia article on "Race" (http://en.wikipedia.org/wiki/Race_(human_classification)#Genetically_differentiated_populations):

Population geneticist Sewall Wright developed one way of measuring genetic differences between populations known as the Fixation index, which is often abbreviated to FST. This statistic is often used in taxonomy to compare differences between any two given populations by measuring the genetic differences among and between populations for individual genes, or for many genes simultaneously.[79] It is often stated that the fixation index for humans is about 0.15. This translates to an estimated 85% of the variation measured in the overall human population is found within individuals of the same population, and about 15% of the variation occurs between populations. These estimates imply that any two individuals from different populations are almost as likely to be more similar to each other than either is to a member of their own group.[6][62] Richard Lewontin, who affirmed these ratios, thus concluded neither "race" nor "subspecies" were appropriate or useful ways to describe human populations.[80] However, others have noticed that group variation was relatively similar to the variation observed in other mammalian species.[81][82][83]

Note how it says "population" not "race". Imho the notion "race" should be avoided in a scientific context because I agree with:

In an ongoing debate, some geneticists[who?] argue that race is neither a meaningful concept nor a useful heuristic device,[84] and even that genetic differences among groups are biologically meaningless,[85] because more genetic variation exists within such races than among them, and that racial traits overlap without discrete boundaries.[86]

ADD REPLY • link 11.3 years ago by Michael 55k

0

Entering edit mode

Thank you very much for your answer, Michael. If you are interested in a discussion on the concept of the race in scientific terms, I can recommend you the book "Fatal Invention" by Dorothy Roberts.

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

"fatal" does not sound like a scientifc term to me. i read about the book on amazon and it seems like Reductio ad Hitlerum for a few hundred pages to me. "Those who subscribe to the opinion that there are no human races are obviously ignorant of modern biology." Ernst Mayr.

ADD REPLY • link 11.3 years ago by Click downvote ▴ 720

0

Entering edit mode

I've read the book and it is not bad. It is true that it doesn't put scientists in a good position, but it does so in an objective way. We are talking about population genetics, a field which 100 years ago was called eugenics, and whose fathers studied how to improve the white race by means of applying laws. I liked the book because it exposes the history of population genetics in a objective (not scientific) way.

ADD REPLY • link 11.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Well, that's why I originally put "race" in quotes, but on further consideration, I realize that even that was not good enough. I've edited my post.

ADD REPLY • link 11.3 years ago by Ryan Thompson ★ 3.6k