Hi all, just wondering if anyone has an idea of how to judge how complete the RNA-seq data is? Of course this should depend on what genome it is from. Thanks in advanced.
Hi all, just wondering if anyone has an idea of how to judge how complete the RNA-seq data is? Of course this should depend on what genome it is from. Thanks in advanced.
You can look at gene representation from some fraction (say 50% of your samples) and compare changes in coverage as you add another 10% or 25%, for example, of the reads. You can do this in terms of total number of genes or mRNA isoforms observed as well as representation of some select genes that are expressed to high, moderate and low levels. Basically, you would do this to see where discovery (of expressed genes) starts to plateau.
I have seen this approach presented at genome conferences.
Edit (6 Oct 2011): I don't recall seeing data from the group who authored the paper Istvan mentioned, but the results are indeed similar to those I have heard and observed others discuss. I suggest taking a good look at their figure 1, showing saturation curves. There is, however, much more to this paper that should be explored for those facing similar issues of gene coverage and saturation.
For some ideas consult the paper titled Differential expression in RNA-seq: A matter of depth
Ken, the GenePattern software has a tool that can help you to determine coverage by gene, locus, transcript, etc. - it is called RNAseqMetrics and is available on the GenePattern server at http://genepattern.broadinstitute.org. A publication on this tool is in process. For general information you can go to http://www.genepattern.org.
Best, Michael
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Perhaps you could give a definition of what you mean by 'complete'. All known loci covered by some number of reads? All splice variants represented? if that's you're definition (you'll require billions of reads for a mammalian genome). What are you actually looking for?
This question makes no sense as is. Please clarify.
Hi seidel and neilfws, my intention for 'complete' refers to 'all known loci covered by some number of reads' as what seidel pointed out. Thanks.