RNA-Seq sequencing depth information from NCBI
1
0
Entering edit mode
10.7 years ago
alittleboy ▴ 220

I am working on several datasets downloaded from NCBI SRA. For example, the zebrafish dataset accessible from here is used for my analysis. However, I am not sure how can I obtain the sequencing depth of this experiment from some meta-data (say, the SraRunInfo.csv file). I summarized the data into a read count matrix, and calculate column sums:

Gata4.1       Gata4.2     Gata4.3     WT.1         WT.2          WT.3 
16779626    37906530   12574486   19009161   21119399   15302189 

It seems that different samples have different sequencing depths... is that correct? So, what is (are) the sequencing depth(s) for this experiment?

Thanks!

RNA-Seq seqencing-depth • 4.3k views
ADD COMMENT
0
Entering edit mode
10.7 years ago

It's RNAseq, why do you care about depth, which will vary by orders of magnitude between different genes? Yes, there will be different numbers of reads for each sample, that's completely normal.

A good estimate of the average depth (if you really wanted to calculate it) would be the numbers you already calculated times the average read length (I'm assuming you did some trimming) divided by whatever the transcriptome size is that you used (realistically, you'd probably whittle down the transcripts by excluding those with no or very few reads in all of the samples, so perhaps "effective transcriptome size" instead).

ADD COMMENT
0
Entering edit mode

Thankyou Devon for your comments. I think sequencing depth in RNA-Seq is an important aspect that biologists/statistician must be aware of. For example, in this paper: http://www.ncbi.nlm.nih.gov/pubmed/24319002, the author discussed sequencing depths and replications in RNA-Seq experiments. The data they analyzed can be accessed here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51403.

Now, for that dataset, I summarized the read counts, and the data having 30 million sequencing depth have column sums:

25578855 25820716 25741936 25179460 25896403 25558230 26002218 25561605 25318388 26253440 24026599 25906691 26154717 25869477

... and the data having 5 million sequencing depth have column sums:

4270832 4301926 4287752 4189579 4314886 4260306 4336277 4261486 4217501 4372894 4002337 4312077 4360152 4317814

Obviously we see the difference. My question is, how can we know the sequencing depth information from NCBI, I mean, from some meta-data information? It seems that different samples may have different depths (at least from other datasets I checked). Thanks again for your suggestions.

ADD REPLY
0
Entering edit mode

They discuss read number, which is only vaguely related to depth. Read number is good to know and you already calculated it. Read depth is a largely meaningless concept in RNAseq.

ADD REPLY
0
Entering edit mode

Also, while you can get raw read counts from meta data, that's not what's interesting. The interesting number is the number of aligned reads, which won't be available generally in SRA or GEO since it contains only the raw reads.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I think I processed the SRA files in NCBI following some pipeline (e.g. http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html ), and use HTSeq to do the count summarization. I think that's the typical data for what statisticians usually work on for the methodological development. Correct me if I am wrong, but I think those count-based approaches based on read counts table are, in some sense, suitable for DE analysis in RNA-Seq.

ADD REPLY
0
Entering edit mode

Yes, the counts from htseq-count are directly useable.

ADD REPLY

Login before adding your answer.

Traffic: 2201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6