What Are Numbers Every Bioinformatician Should Know?
12
23
Entering edit mode
11.3 years ago
brentp 24k

There is a well-known set of numbers that every computer scientist should know by Jeff Dean. See http://www.quora.com/What-are-the-numbers-that-every-computer-engineer-should-know-according-to-Jeff-Dean

What are the sequencing related numbers that a bioinformatician should know? For example,

  • Reads from Illumina Hi-Seq 2000 lane: 350 million
  • Number of reads to cover 3.0GB of sequence at 30X: 900 million (for 2x100bp)
  • Cost per Hi-Seq 2000 lane: $2,500

If you put your answer in that format, I'll aggregate the best ones.

In addition to whatever you deem important, I'm interested in recommended read-counts for RNA-Seq, ChIP-Seq, file sizes for BAMs and fastq.gz's per number of reads, processing times for aligning to 3.0GB of reference for common aligners, etc.

knowledge career • 9.7k views
ADD COMMENT
9
Entering edit mode

"3": Never write code after 3 beers.

ADD REPLY
3
Entering edit mode

Correct. Two is optimal. Relevant xkcd: http://xkcd.com/323/

ADD REPLY
1
Entering edit mode

At this point, there's always a relevant xkcd :)

ADD REPLY
2
Entering edit mode

Might be some relevant numbers here: Bioinformatics "Cheat Sheet"

ADD REPLY
1
Entering edit mode

Might want to use the "cost" price of sequencing technologies, because prices vary significantly according to country/continent/provider etc (cost price for a HiSeq lane is quite a bit lower than $2,500)

ADD REPLY
1
Entering edit mode

we should ask this question every 1-2 years, and see how these numbers change

ADD REPLY
1
Entering edit mode

Is 350 Mreads realistic for HiSeq? My lab usually counts on 200 Mreads, but we also do lots of different kinds of samples, so we can't count on every library behaving completely consistently, which maybe some groups can.

ADD REPLY
0
Entering edit mode

I wonder if it refers to single-end reads - I thought Illumina's reference value was a minimum of somewhere around 140M paired end reads?! (even though you sometimes get much more, of course)

ADD REPLY
0
Entering edit mode

I'll look around and then update. I'm not confident on those numbers. I would call 140M read-pairs 280M reads.

ADD REPLY
0
Entering edit mode

42 for geeky jokes on parties nobody thinks funny

ADD REPLY
11
Entering edit mode
11.3 years ago
Neilfws 49k

Percentage of everything that's crap: 90.

ADD COMMENT
0
Entering edit mode

Funny. Sturgeon's Law is very quotable, was written 60+ years ago, and about Science Fiction, a genre I always liked, yet I never heard it before :)

ADD REPLY
7
Entering edit mode
11.2 years ago

There is a nice review, published just today, resuming some numbers about genetic variability in the human genome:

Since there has been a discussion on this matter in this thread, I'll resume some of these numbers here. The paper contains many references, which I am too lazy to copy here... So, if you want to know more, just read the paper.

It is important to note that there is only slightly more diversity between individuals of two major continental group, than between individuals of the same population. Craig Venter and Jim Watson (both "caucasian") share less SNVs between themselves that either of them shares with Seong-Jin Kim (a korean scientist).

General Chimp/Human differences:

  • number of single-nucleotide differences between chimp and human: ~35 millions (+5 milliion insertion/deletion events)
  • percentage of single-nucleotide changes between chimp and human: 1.23%
  • percentage of single-nucleotide changes that are fixed in chimp or in human: 1.06% over 1.23%

Variability between individuals:

  • number of SNVs in humans: ~65 millions
  • on average, a pair of humans is expected to differ on 1 base every 1000

Fst, pre-1000 Genomes (Fst is a measure of genetic differentiation; 0 -> no differentiation between individuals; 1->highest genetic differentiation):

  • average Fst among major continental group ranges from 0.05 to 0.13
  • genetic diversity in humans is far lower than in other primates. Genetic variance in humans is only 5-13% of the variance in other primates.

Fst, after 1000 Genomes

  • Fst between African and Europeans: 0.071
  • Fst between African and Asians: 0.083
  • Fst between Asians and Europeans: 0.052
  • Fst in Gorilla populations: 0.38
  • Fst in Chimp: 0.32

Allele sharing across continents

  • 81.2% of SNVs are present in all continental groups (12.4% if we consider haplotypes instead of SNVs)
  • less than 1% SNVs are specific to a continent (11% if we consider haplotypes instead of SNVs)
  • only 0.06% of SNVs are specific to Eurasia

Haplotype Sharing across continents

  • 2% haplotype blocks restricted to Asia
  • 2% haplotype blocks restricted to Europe
  • 25% haplotype blocks restricted to Africa

Major Genetic Groups

  • According to Rosenberg et al (the Structure software), all human individuals can be classified into 5 continental groups
  • Li et al confirmed the same, on the HGDP panel
  • however, recent attempts failed to confirm this classification, claiming that it may be due to confounding effects.
  • races according to the US census system: 15 plus "other races". I recommend you to read this book if you are interested on the matter of races in scientific use
ADD COMMENT
0
Entering edit mode

Very nice, this could be inserted into the wikipedia article as a reference. Somebody willing to add it?

ADD REPLY
0
Entering edit mode

This is a good write-up, but if you think this means there cannot be substantial and systematic differences due to biology between different human groups you are guilty of Lewontin's fallacy. See http://infoproc.blogspot.no/2008/01/no-scientific-basis-for-race.html

Furthermore it does not make much sense to compare the difference between groups and withn groups. See http://evoandproud.blogspot.no/2011/11/apples-oranges-and-genes.html

That "genetic diversity in humans is far lower than in other primates" might be true, but the story is different in canines. I quote "there is less mtDNA difference between dogs, wolves and coyotes than there is between the various ethnic groups of human beings, which are recognized as belonging to a single species" - Evolution of working dogs J. Serpell (Ed.), The Domestic Dog: Its Evolution, Behaviour and Interactions with People, Cambridge University Press

ADD REPLY
0
Entering edit mode

This may be useful for SNV vs SNP reference: SNP, DIP, SNV notation

ADD REPLY
6
Entering edit mode
11.3 years ago

The PHRED-offsets for all different FASTQ-standards:

  • Sanger has 0-93, using ASCII 33 to 126
  • Solexa/Illumina 1.0 has -5 to 62, using ASCII 59 to 126
  • Solexa/Illumina 1.3 has 0 to 62 using ASCII 64 to 126
  • Solexa/Illumina 1.8 has 0-93, using ASCII 33 to 126 (same as Sanger)

This leads to the question: What's a good/reliable quality score to use as a cut-off? I usually use either 30 or 40 in all formats, but I'm open to suggestions. Of course, how high your cutoff is depends on what you want to do with your data. SNP-scoring needs more reliable bases than genome assembly.

Source

ADD COMMENT
5
Entering edit mode
11.3 years ago

Fundamental statistics about the genome/transcriptome of the species being studied. For example, for human:

  • Contig length total 3.2 Gb.
  • Chromosome length total 3.1 Gb.
  • Base Pairs: 3,323,950,079
  • Coding genes: 20,774
  • Non coding genes: 22,493
  • Pseudogenes: 14,145
  • Gene transcripts: 194,846
  • Short Variants (SNPs, indels, somatic mutations): 54,965,377
  • Structural variants: 10,266,123

Ensembl makes these stats available in nice regularly updated reports for many species such as: human, mouse, rat, etc...

ADD COMMENT
4
Entering edit mode
11.3 years ago

Just some ramblings:

log10(x) != log(x)

Know your model organism:

23 human chromosomes

20 mouse chromosomes

~3 million non-reference SNVs per human genome

transition to transversion ratio for human exomes ~ 2

SAM flags

4 - unaligned

12 - both unaligned (mate pair).

ADD COMMENT
1
Entering edit mode

I guess you're counting haploid number of autosomes + one of X/Y for chromosomes. Not to be confused with "all possible chromosomes": for humans 1-22, X, Y + M = 25.

ADD REPLY
0
Entering edit mode

and you are talking about a diplod organism, for human: (1-22) * 2 + (XX or XY) + M = 47

ADD REPLY
1
Entering edit mode

Handy tool for explaining all SAM flags and their combinations: http://picard.sourceforge.net/explain-flags.html

ADD REPLY
0
Entering edit mode

Use it almost every day.

ADD REPLY
4
Entering edit mode
11.3 years ago
Woa ★ 2.9k

Check also the database of useful biological numbers: http://bionumbers.hms.harvard.edu/

ADD COMMENT
4
Entering edit mode
11.3 years ago
koukougogo ▴ 60

Taxonomy ID for human: 9606, for mouse: 10090

ADD COMMENT
0
Entering edit mode

E. coli K12 511145

ADD REPLY
4
Entering edit mode
11.3 years ago

Reference genome build/version numbers according to UCSC and NCBI (or sequencing consortium for the species of interest).

For example, hg19 = GRCh37.

The UCSC releases FAQ has a lot of these.

ADD COMMENT
3
Entering edit mode
11.3 years ago

Some ideas:

  1. In the Sanger encoding runs of qualities that seem like censored swearing in a comics i.e. $#&%! correspond to very low qualities. Runs of qualities composed of readable letters correspond to very high qualities.
  2. One always has to be aware of the genome size of the organism that is being sequenced. That is usually the first question I ask when someone starts talking about a genome I haven't heard of before.
ADD COMMENT
1
Entering edit mode

Addendum to the genome size: Number of chromosomes is also important

ADD REPLY
3
Entering edit mode
11.3 years ago
Adam ★ 1.0k

I always found this book very useful for various 'should know' facts and figures: A Short Guide to the Human Genome

ADD COMMENT
2
Entering edit mode
3.9 years ago
ATpoint 85k

Funny to dig in old threads, when HiSeq2000 was state of the art :)

January 2021: Novaseq 6000 S4 flowcell for 16.000€ yielding 8 billion (yes 8000 million) 2x150bp spots.

ADD COMMENT
1
Entering edit mode
11.3 years ago
Ryan Thompson ★ 3.6k

Roughly 15% of all human genetic variation is attributable to "race" population, and the other 85% exists between people of the same race population.

ADD COMMENT
1
Entering edit mode

http://en.wikipedia.org/wiki/Human_Genetic_Diversity:_Lewontin's_Fallacy http://westhunt.wordpress.com/2012/01/26/lewontins-argument/ See Secardic's "Race: a social destruction of a biological concept". You'll find a pdf by googling.

ADD REPLY
0
Entering edit mode

The opposite is true, "race" is a purely social or political construction, coming from a time were nothing was known about population genetics, race is a concept used to promote segregation, exercise power to rule and exploit, oppression and genocide, based on mostly irrelevant phenotypes, coarse geographic borders, and other arbitrary differences. Comparable to phrenology. Its deconstruction does nothing but good for the scientific community and society in general.

ADD REPLY
0
Entering edit mode

reading up on it, i might agree that there are good reasons to not use the term about humans, even though it seems hard from a biological perspective to contend that race is any less real in humans than eg dogs. gnxp has a good write-up that i agree with: http://blogs.discovermagazine.com/gnxp/2011/11/on-structure-variation-and-race/#.UhZEr2SshJU

Main points: 1) Human populations can be easily separated into plausible clusters using a random set of genetic markers

2) The differences between human populations are not trivial

ADD REPLY
0
Entering edit mode
  1. Dog races are a product of artificial selection. They have been produced by dog breeders, after gradually selecting individuals after many populations. Since this is not the case for humans, I don't think that it is the same to apply the concept of race to humans and dogs.
  2. Into how many clusters can human individuals be grouped? What is the logic for choosing the correct number of clusters?
  3. I think that social and environmental differences are far more important than the genetics. In any case, it has been shown that the differences between individuals of the same population are only slightly higher in number than the differences between individuals of two continental groups. See the case of Jim Watson, Craig Venter and Seong-Jin Kim : the latter is more similar to the other two, than the other two between each other.
ADD REPLY
0
Entering edit mode

I'm not going to continue this boring discussion, but some clever people have argued that many human groups are the products of artificial selection too: see the http://en.wikipedia.org/wiki/The_10,000_Year_Explosion (which got a favorable review in Scientific American)

And comparing in-group and between-group differences is like comparing apples and oranges and seems to be a deceitful way of using statistics: http://evoandproud.blogspot.no/2011/11/apples-oranges-and-genes.html

Henry Harpendings comment is worth noting too: "Notice also that there is a subtle sleight-of-mind in the formulations of Lewontin's 'finding' (It was published previously by Luca Cavalli). Given that 15% of the diversity is among groups and 85% within, we are diploids so of the 85% half is between individuals and half is between alleles within people."

ADD REPLY
0
Entering edit mode

This was an interesting discussion, thank you for participating to it.

ADD REPLY
0
Entering edit mode

Where did you get these numbers? It is dangerous to play with genetics and races, without references.

ADD REPLY
3
Entering edit mode

I think it's a misinterpretation of the wikipedia article on "Race" (http://en.wikipedia.org/wiki/Race_(human_classification)#Genetically_differentiated_populations):

Population geneticist Sewall Wright developed one way of measuring genetic differences between populations known as the Fixation index, which is often abbreviated to FST. This statistic is often used in taxonomy to compare differences between any two given populations by measuring the genetic differences among and between populations for individual genes, or for many genes simultaneously.[79] It is often stated that the fixation index for humans is about 0.15. This translates to an estimated 85% of the variation measured in the overall human population is found within individuals of the same population, and about 15% of the variation occurs between populations. These estimates imply that any two individuals from different populations are almost as likely to be more similar to each other than either is to a member of their own group.[6][62] Richard Lewontin, who affirmed these ratios, thus concluded neither "race" nor "subspecies" were appropriate or useful ways to describe human populations.[80] However, others have noticed that group variation was relatively similar to the variation observed in other mammalian species.[81][82][83]

Note how it says "population" not "race". Imho the notion "race" should be avoided in a scientific context because I agree with:

In an ongoing debate, some geneticists[who?] argue that race is neither a meaningful concept nor a useful heuristic device,[84] and even that genetic differences among groups are biologically meaningless,[85] because more genetic variation exists within such races than among them, and that racial traits overlap without discrete boundaries.[86]

ADD REPLY
0
Entering edit mode

Thank you very much for your answer, Michael. If you are interested in a discussion on the concept of the race in scientific terms, I can recommend you the book "Fatal Invention" by Dorothy Roberts.

ADD REPLY
0
Entering edit mode

"fatal" does not sound like a scientifc term to me. i read about the book on amazon and it seems like Reductio ad Hitlerum for a few hundred pages to me. "Those who subscribe to the opinion that there are no human races are obviously ignorant of modern biology." Ernst Mayr.

ADD REPLY
0
Entering edit mode

I've read the book and it is not bad. It is true that it doesn't put scientists in a good position, but it does so in an objective way. We are talking about population genetics, a field which 100 years ago was called eugenics, and whose fathers studied how to improve the white race by means of applying laws. I liked the book because it exposes the history of population genetics in a objective (not scientific) way.

ADD REPLY
0
Entering edit mode

Well, that's why I originally put "race" in quotes, but on further consideration, I realize that even that was not good enough. I've edited my post.

ADD REPLY

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6