Question

How To Handle Replicates With Huge Differences In Number Of Reads?

2

Entering edit mode

13.3 years ago

Rayna ▴ 280

Hey everyone,

I have a question regarding the way to handle Illumina single-end RNA-seq data. In fact, I have a few samples, each having 2 biological replicates. Nothing extraordinary so far except the case of one sample. Indeed, one of the replicates generated a low amount of reads (~13 millions) whereas the average number of reads I have for all the other cases is the double (27-30 millions). So, GATC reran this particular sample, but in a very weird way -- my guess is, alone in one lane -- which outcomes 150 millions reads...

This is of course too huge and thus, introduces a discrepancy in the data. I am running out of ideas how to handle it, so I'd really appreciate if someone could help.

Thanks a lot in advance :)

rna-seq next-gen analysis • 5.2k views

ADD COMMENT • link updated 13.0 years ago by Rm 8.3k • written 13.3 years ago by Rayna ▴ 280

0

Entering edit mode

Why not work with 20% of your new data? Assuming that there is no specific bias in your new run, it doesn't seem a bad idea.

ADD REPLY • link 13.3 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Strong assumption! ;-)

ADD REPLY • link 13.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

Thanks for your suggestion. Hum, if you pick them at random, this should be kind of acceptable. I thought of picking ~15 millions by chance, but it sounded very arbitrary...

ADD REPLY • link 13.3 years ago by Rayna ▴ 280

score 3 · Answer 1 · 2012-04-27

3

Entering edit mode

13.3 years ago

Stefano Berri 4.4k

I am not an expert on RNA-seq, but I guess you should normalise the data so that the signal takes into account the overall number of reads (like RPKM). The difference will be in the internal (technical) variation. The more reads the more accurate. However, it is likely that most of the variation comes from the biological replicates, independent on the number of reads you have. You could randomly select a percentage of the reads, but that would mean losing information and precious data. You could do it to check that the results are consistent, but it should not be a major problem.

What is the "discrepancy" in the data?

ADD COMMENT • link 13.3 years ago by Stefano Berri 4.4k

0

Entering edit mode

Thanks for your suggestions.

Under discrepancy, I meant the fact that I can have up to 5 times more reads for one of the replicates (150 millions) in comparison to the other (~30 millions).

RPKM gives an estimate of the quantities but if you want to use DESeq/EdgeR, you need the raw counts, not the RPKM. Hence my question since I use DESeq for the analysis.

ADD REPLY • link 13.3 years ago by Rayna ▴ 280

score 3 · Answer 2 · 2012-04-27

3

Entering edit mode

13.3 years ago

HMZ Gheidan ▴ 30

In principle, negative-binomial based tests such as that of DESeq and edgeR should be able to deal even with vastly different library sizes. Of course, the huge sample will not add much power because you have so many reads only on one side of the comparison.

If you do a binomial test, it does not matter how you deal with it because the result will be wrong anyway. (See the numerous earlier threads on why Poisson-based tests, and that includes the binomial test, are inadmissible because they ignore biological variability.)

ADD COMMENT • link 13.3 years ago by HMZ Gheidan ▴ 30

0

Entering edit mode

This is something I hadn't thought about, thanks! So, using DESeq should be ok :) (even without down sampling as proposed by Rm below)

ADD REPLY • link 13.3 years ago by Rayna ▴ 280

score 2 · Answer 3 · 2012-04-27

2

Entering edit mode

13.3 years ago

Rm 8.3k

Deseq and edgeR should take care of it. If not try down sampling your fastq files.

ADD COMMENT • link 13.3 years ago by Rm 8.3k