Question

How To Check For The Saturation Of The Library ?

1

Entering edit mode

10.7 years ago

Ashutosh Pandey 12k

Dear All,

It may be a trivial question but what would be a best way to know if resequencing of a transcriptomic library at a higher depth will generate extra results. Let's assume I have a library that was sequenced at a depth of 5 million reads. The number of alignment with non-unique start sites (PCR duplicates) is around ~40%. Now I want to know if resequencing the same library at a depth of 20 million reads will add new results to the already existing ones that I generated from the run containing 5 million reads. I wish to know is it worth to pay an extra money if it doesn't add any new information in the results. I can perform the following comparative analyses after running the same library at a depth of 10 million reads. The analyses would compare the following results from the two runs:

1) Compare the number of expressed genes (>10 RPKM) in sample with 5 million reads and 10 million reads. If I find a substantial increase in the number of expressed genes, then running the library at a higher depth will make sense. Similarly, I can also look at the number of deferentially expressed genes between condition 1 and 2 in Sample with 5 million reads and Sample with 10 million reads.

2) Similar analysis as above but for spliced junctions. If I can find substantial increase in number of reads aligning on exon-exon junctions that may be useful.

3) I can combine the two runs and check if the rate of PCR duplicates stays the same (~40%) and doesn't shoot up dramatically, then I may be adding newer reads.

Feel free to comment or add your suggestions. Also, if there are some good reviews about the same somewhere, please post them here.

rna-seq library • 3.2k views

ADD COMMENT • link updated 10.7 years ago by Charles Warden 8.3k • written 10.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Just want to add that we are interested in splice junction discovery too.

ADD REPLY • link 10.7 years ago by Ashutosh Pandey 12k

score 2 · Answer 1 · 2014-03-04

2

Entering edit mode

10.7 years ago

Charles Warden 8.3k

I think PCR duplicates are hard to deal with in RNA-Seq data, but I would say you generally want 10 million reads. After that, I think replicates are more important than coverage.

You can see some more detailed statistics in this article:

http://www.ncbi.nlm.nih.gov/pubmed/24319002

ADD COMMENT • link 10.7 years ago by Charles Warden 8.3k

0

Entering edit mode

Thanks for the paper.

ADD REPLY • link 10.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Sure, no problem.

Splice junction discovery will be a bit of another story. Unlike gene expression (which I think is OK for single-end), you'll want paired-end (and/or longer read) data and higher coverage (perhaps starting with 20 million reads? not as sure in that case).

ADD REPLY • link 10.7 years ago by Charles Warden 8.3k

0

Entering edit mode

I have heard that paired end is better for splice junction discovery. Is it only because the paired-end reads can be mapped more confidently than the single end read? Because single-end reads can be soft clipped by the aligner too.

ADD REPLY • link 10.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

In practice, I know that MATS could provide splicing events for the same sample when processed with a paired-end library but it shouldn't call any events for that same sample with a single-end library.

I think it is an issue with being able to confidently identifying the mapping for fragments of a 100 bp read. It might have been a different story if I had access to 300 bp reads. So, I think the short answer is "yes".

ADD REPLY • link 10.7 years ago by Charles Warden 8.3k

0

Entering edit mode

thanks. it would be great if you know a reference paper or if you come across a reference paper that talks about inefficiency of single end read to detect splice splice junctions, please let me know. Thanks.

ADD REPLY • link 10.7 years ago by Ashutosh Pandey 12k