Hi,
I am toying with a new Next Gen Sequencing dataset in which each sequence is tagged according to the individual from which it was extracted. In this 454 experiment, we received about 1.8 million sequences in total. cDNA was the starting material for this experiment so, in each contig (or gene), the number of reads from an individual is correlated to the level of expression of that gene in that individual.
What are the normalization steps that should be applied to the sequence counts per individuals in order to be able to use these measures as a 'level of expression'?
The two that come to mind immediately are:
- Divide by the total number of sequences in each experimental group
- Divide by the number of sequences in each individual
What else do you think should be done?
Thanks!
Is this a SAGE experiment (Serial analysis of gene expression)?
sounds like DGE
This is a 454 experiment. We are doing a few things with these data, including this exploration of gene expression differences between the study groups. I added a precision regarding the NGS method to the question.
One thing is the technique you're using, the other thing is what your data represents. As Jeremy pointed out it looks like DGE, which is very similar to SAGE. My answer is below.