Towards the end, the article mentions that advances like paired-end and longer reads can improve performance and alleviate some of the problems by reducing the incidence of these.
It seems like the idea in the end was to reduce the 'risk' of having problematic samples by increasing the number of replicates, however, just like better technology doesn't totally negate these risks, neither does increasing the number of replicates.
I guess the best approach then is to use better sequencing technology with increased numbers of replicates. If one doesn't have the money to do this, which ends up being the better route: better sequencing tech or more replicates? Which ends up being more cost effective?
Can it be assumed that paired-end and longer reads will have the same dynamics when dealing with "bad" replicates?
I noticed that in the w/t samples that the "bad" replicates seemed to have occurred within a more or less contiguous region. I wonder if the authors had balanced their culture plate(s). Hopefully they didn't just use a 96 well plate with one side having w/t and the other having their deletion.
I am one of the authors on the paper. It's great to see this discussion, shame I'm three months late.
Sequencing technology doesn't obviate the need for biological replicates; variability is a fundamental feature of biology and needs to be measured in all experiments. See this paper. The comment regarding paired-end data is that we may have been able to remove some of the artifacts if it wasn't SE data; no guarantee that it would 'rescue' the bad samples, however.
No we didn't use a 96-well plate. All 96 cultures were grown separately and the libraries were prepared in batches of 24 with 12 of each condition per batch - randomly assigned. The bad replicates were not consistent with batch or lane.
Based on the description in this paper as below, may I say that edgeR is one of best tools to do differentially expression analysis for RNA-Seq data?
A key finding of this work is the demonstration that the read-count distribution of the majority of genes is consistent with the negative binomial model. Reassuringly, many of the most widely used RNA-seq DGE tools (e.g., egeR, DESeq, cuffdiff, ...
Our findings favor the approach implemented in edgeR, where variance for one gene is squeezed towards a common dispersion calculated across all genes.
ADD COMMENT
• link
updated 22 months ago by
Ram
44k
•
written 9.6 years ago by
Gary
▴
480
I agree that replicates are extremely important, especially for capturing biological variability.
However, I think a lot of people do experiments without replicates or only duplicates. In other words, I think the number of researchers using triplicates is less than 80%.
Also, I found the paper to be interesting, but I think 48 replicates was a bit excessive. I think the point probably could have been made just as well with less than half that number.
The above experiment the replicates should have been termed technical replicates. There are always going to be biological differences due to the nature of biology (more like noise), but the linked paper was showing the impact of technical variability on the outcome of the experiment. The yeast were genetically homogenous, so really the only major source of variance would have been plating effects or other batch effects. This is the whole point of a cell line, if you can't assume that two cultures of a cell line as supposed to be biologically equivalent, then why have cell lines. The only reason there will be differences is due to variance in how the protocol was performed. The only biological difference was the single deletion of a gene, aside that differences withing genotypes are either natural noise or variance introduced by the experimentalist.
Say I had an experiment where I compared infected and uninfected cells from a cell line. If I had 3 replicates of each condition per timepoint, these would be more like technical replicates, NOT biological replicates. The cells and agent are the same, so the reason for having replicates is to determine how the protocol may have added variance to the experiment. In this case I need to use replicates to determine/mitigate any impact variances in performing the experiment (e.g. had to use the restroom during the experiment) will have. In other words, the replicates are there to capture/mitigate how the experimenter may have induced variance into things.
If I'm getting PBMCs from infected and healthy patients, the situation changes. If I have three sick patients and three healthy ones, I have three biological replicates. These replicates allow me to capture the impact a given individual will make. In other words, I can tell if a response might be general to all infected individuals, or if it might be due to something common to only one of the three people.
However in this case, because I still have to collect blood, process it and so on, there's plenty of risk for problems that might lead to bias through variance in performing the protocol. So although I'm able to capture my biological variance, I am still missing the technical variance. In this case the best thing would be to perform the processing/extraction/etc three times on a single blood draw.
I don't think these differences are always explained clearly.
ADD REPLY
• link
updated 22 months ago by
Ram
44k
•
written 9.6 years ago by
pld
5.1k
1
Entering edit mode
Gene expression patterns change over time, so there will be biological variation even if the genomes are 100% identical (similarly, I would not expect different cultures of the same cell line to have identical expression). I would call technical replicates to be different libraries isolated from the same sample of extracted RNA.
Nevertheless, I agree that it is important to distinguish between biological and technical replicates. My understanding is that they did both: "The sample libraries were sequenced in seven of the eight lanes on an Illumina HiSeq2000, to give seven sequencing (technical) replicates for each biological replicate"
Exactly, and because you're a human and not a robot, there will be fluctuations in the time it took at each step to perform the protocol, leading to differences in gene expression. Not because there's biologically anything going on, but because of variance in the technical aspects of the protocol. We all do the bioinformatics here, but we can't forget that technical variation can occur long before the RNA is even extracted. Because of that, you need technical replicates to capture variance in other places as well.
I am arguing that there were no biological replicates in the article. I'm saying you can't have biological replicates from a single instance of a cell line, again that's the point of a cell line. These replicates just measured technical variability at other places such as their cell culture methods/practices, plating effects, operator error and so on.
ADD REPLY
• link
updated 22 months ago by
Ram
44k
•
written 9.5 years ago by
pld
5.1k
0
Entering edit mode
The terms biological vs technical replicates are ones we struggled a lot with in the study. This is a very grey area.
Our experimental design was to replicate a typical cell-culture/line study: grow samples from the same stock and perturb some of them somehow (drug treatment, gene knock-out, siRNA, etc). All the 96 samples were from individual growths of yeast, so are biological replicates of the yeast strain and gene KO under study. We controlled for 'technical variation' as much as possible by using a block design at all steps of the experiment.
Comparing different yeast strains is a different type of experiment and not one that is typical in gene expression studies.
Most differential gene expression tools assume a model for the distribution of gene expression typically either negative binomial (in e.g. DESeq, edgeR, cuffdiff) or log normal (e.g. limma). One of the main aims of the experiment was to answer the question of what is the true distribution of gene expression in RNA-seq data and how appropriate are the assumed models.
From some preliminary calculations we determined that we needed a large number of replicates to be able to measure the expression distribution for each gene in an experiment and then test the fitness against the assumed models. We settled on 48 per sample as we could multiplex 96 per lane using std protocols. I don't think we could have done the experiment reliably with fewer than 24 replicates, especially after losing a few 'bad' replicates.
I think I'm going to tape that preprint (and your article) to the door to our core facility.
Yeah, we hope this article gets more attention. There needs to be more studies like these for other applications.
Towards the end, the article mentions that advances like paired-end and longer reads can improve performance and alleviate some of the problems by reducing the incidence of these.
It seems like the idea in the end was to reduce the 'risk' of having problematic samples by increasing the number of replicates, however, just like better technology doesn't totally negate these risks, neither does increasing the number of replicates.
I guess the best approach then is to use better sequencing technology with increased numbers of replicates. If one doesn't have the money to do this, which ends up being the better route: better sequencing tech or more replicates? Which ends up being more cost effective?
Can it be assumed that paired-end and longer reads will have the same dynamics when dealing with "bad" replicates?
I noticed that in the w/t samples that the "bad" replicates seemed to have occurred within a more or less contiguous region. I wonder if the authors had balanced their culture plate(s). Hopefully they didn't just use a 96 well plate with one side having w/t and the other having their deletion.
I am one of the authors on the paper. It's great to see this discussion, shame I'm three months late.
Sequencing technology doesn't obviate the need for biological replicates; variability is a fundamental feature of biology and needs to be measured in all experiments. See this paper. The comment regarding paired-end data is that we may have been able to remove some of the artifacts if it wasn't SE data; no guarantee that it would 'rescue' the bad samples, however.
No we didn't use a 96-well plate. All 96 cultures were grown separately and the libraries were prepared in batches of 24 with 12 of each condition per batch - randomly assigned. The bad replicates were not consistent with batch or lane.