I am working on several datasets downloaded from NCBI SRA. For example, the zebrafish dataset accessible from here is used for my analysis. However, I am not sure how can I obtain the sequencing depth of this experiment from some meta-data (say, the SraRunInfo.csv file). I summarized the data into a read count matrix, and calculate column sums:
Gata4.1 Gata4.2 Gata4.3 WT.1 WT.2 WT.3
16779626 37906530 12574486 19009161 21119399 15302189
It seems that different samples have different sequencing depths... is that correct? So, what is (are) the sequencing depth(s) for this experiment?
Thanks!
Thankyou Devon for your comments. I think sequencing depth in RNA-Seq is an important aspect that biologists/statistician must be aware of. For example, in this paper: http://www.ncbi.nlm.nih.gov/pubmed/24319002, the author discussed sequencing depths and replications in RNA-Seq experiments. The data they analyzed can be accessed here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51403.
Now, for that dataset, I summarized the read counts, and the data having 30 million sequencing depth have column sums:
... and the data having 5 million sequencing depth have column sums:
Obviously we see the difference. My question is, how can we know the sequencing depth information from NCBI, I mean, from some meta-data information? It seems that different samples may have different depths (at least from other datasets I checked). Thanks again for your suggestions.
They discuss read number, which is only vaguely related to depth. Read number is good to know and you already calculated it. Read depth is a largely meaningless concept in RNAseq.
Also, while you can get raw read counts from meta data, that's not what's interesting. The interesting number is the number of aligned reads, which won't be available generally in SRA or GEO since it contains only the raw reads.
Thanks for your reply. I think I processed the SRA files in NCBI following some pipeline (e.g. http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html ), and use HTSeq to do the count summarization. I think that's the typical data for what statisticians usually work on for the methodological development. Correct me if I am wrong, but I think those count-based approaches based on read counts table are, in some sense, suitable for DE analysis in RNA-Seq.
Yes, the counts from htseq-count are directly useable.