Question

Gene Expression Experiment Using Ngs Data

3

Entering edit mode

15.0 years ago

Eric Normandeau 11k

Hi,

I am toying with a new Next Gen Sequencing dataset in which each sequence is tagged according to the individual from which it was extracted. In this 454 experiment, we received about 1.8 million sequences in total. cDNA was the starting material for this experiment so, in each contig (or gene), the number of reads from an individual is correlated to the level of expression of that gene in that individual.

What are the normalization steps that should be applied to the sequence counts per individuals in order to be able to use these measures as a 'level of expression'?

The two that come to mind immediately are:

Divide by the total number of sequences in each experimental group
Divide by the number of sequences in each individual

What else do you think should be done?

Thanks!

next-gen-sequencing gene-expression • 5.5k views

ADD COMMENT • link updated 17 months ago by Ram 45k • written 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Is this a SAGE experiment (Serial analysis of gene expression)?

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

sounds like DGE

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

This is a 454 experiment. We are doing a few things with these data, including this exploration of gene expression differences between the study groups. I added a precision regarding the NGS method to the question.

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

One thing is the technique you're using, the other thing is what your data represents. As Jeremy pointed out it looks like DGE, which is very similar to SAGE. My answer is below.

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

Ram · Answer 1 · 2010-04-23

6

Entering edit mode

15.0 years ago

Paulo Nuin ★ 3.7k

I think your questions are very broad and there's no simple answer, especially because they involve a lot of statistics, and less of a computer approach. SAGE/DGE data is very different than microarray, regarding its analysis and sometimes straightforward methods used in MA analysis cannot be applied here.

For this type of data, the best option that I found was edgeR, a R/Bioconductor package. Be sure to read the docs and some extra information that comes with the package.

http://www.bioconductor.org/packages/bioc/html/edgeR.html

ADD COMMENT • link updated 17 months ago by Ram 45k • written 15.0 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Thank you @nuin. What were the other options that you surveyed? Why was edgeR the best? Cheers

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

If I'm not wrong I tried another Bioconductor package that is not available anymore (or not updated). edgeR is very simple to use, the manual is well written and it gives you good results, including nice graphs.

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

@nuin, the link seems to be broken. The following link seems to be the right one now: http://bioconductor.org/packages/2.6/bioc/html/edgeR.html

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

score 2 · Answer 2 · 2010-04-23

2

Entering edit mode

15.0 years ago

Istvan Albert 102k

Some options that come to mind:

use housekeeping genes - genes with stable and unchanged expression levels - to estimate variability
similarly to cross slide microarray normalization methods you may want to assume that the average expression levels are the same for each individual
use spiked controls - maybe a little late for that

Definitely look for artifacts introduced by PCR amplification.

ADD COMMENT • link 15.0 years ago by Istvan Albert 102k

0

Entering edit mode

Nice suggestions. Points 1 and 2 will be done. Too late, as you say, for spiked controls, but I keep the idea. How would you look for artifacts introduced by PCR amplification?

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Unusually high counts are one indication, basically looking for neighbouring or overlapping regions that have wildly different read coverages.

ADD REPLY • link 15.0 years ago by Istvan Albert 102k

score 2 · Answer 3 · 2010-04-23

2

Entering edit mode

15.0 years ago

Jeremy Leipzig 23k

I would not reinvent the wheel here since DGE has been around for 3 years or so. First do a literature search starting with Avi Mortazavi's articles.

ADD COMMENT • link 15.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Yeah, I do "recommend Mapping and quantifying mammalian transcriptomes by RNA-Seq" in Nature Methods. It deals with this kind of data with all possible problem (segmental duplications, gene duplications, etc.). But, as the samples are from different subjects without known genomes, these problems will be quite amplified. Carefully chosen reference sequences are priority one.

ADD REPLY • link 15.0 years ago by Jarretinha 3.5k

0

Entering edit mode

Thanks for the references guys.

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Thanks for the references both of you!

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Ram · Answer 4 · 2010-04-28

Following suggested readings in your replies, I have stumbled upon a very recent R package, called DESeq, which seems tailored for my application. Specifically, as they mention in the documentation, DESeq:

provides a powerful tool to estimate the variance in such data [RNA-seq and others] and test for differential expression.

Starting from a table of sequence counts (one line per gene, one column per sample, including proper treatment of replicates), it outputs (among many things) a list of p-values regarding the differential expression of genes between samples, taken 2 by 2. Documentation is pretty complete and very comprehensible.

Just wanted to share! Here are the links to the DESeq package download and information pages and the 'companion paper':

Cheers!