Question

Will variance between the number of reads mapped in samples impact my differential analysis?

0

Entering edit mode

7.6 years ago

a.rex ▴ 350

I have mapped fastq files to my transcriptome of interest in Kallisto, and performed a differential expression analysis in Sleuth (control versus sample). Although I obtain a list of differentially expressed loci after performing a Wald test, a look at the read mapping metrics makes me unsure about the data. The mappings I obtain are as follows:

    sample            reads_mapped  reads_proc frac_mapped       bootstraps 
    control_replicate1  39333462    48943096    0.8037              100 
    control_replicate2  41571482    51237738    0.8113              100 
    sample_replicate1   15284000    21768028    0.7021              100 
    sample_replicate2   9515367     21270730    0.4473              100

As you can see, sample_replicate2 is considerably fewer reads mapped than the other sample and controls. Will this impact my differential analysis. Since they are scale in sleuth (i.e. sleuth normalised tpm) will this be an issue?

RNA-Seq kallisto sleuth • 1.7k views

ADD COMMENT • link updated 7.6 years ago by Istvan Albert 102k • written 7.6 years ago by a.rex ▴ 350

score 1 · Answer 1 · 2017-05-03

1

Entering edit mode

7.6 years ago

Istvan Albert 102k

You don't need to worry unless you devise your own method.

Typical differential expression detection methods will account for the differences between the number or reads across the samples.

ADD COMMENT • link 7.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for your reply Istvan. Typically though, for RNA-seq, I have heard that you need at least 20million reads for each replicate. Is this true, and do you know what this is based on? or can I get away with the above figures.

ADD REPLY • link 7.6 years ago by a.rex ▴ 350

1

Entering edit mode

As mentioned by Albert, the software will take account of this factor by such as normalization between samples. The biological results from the low read number samples are more less convincing. For example samples of low number reads will have bigger variation by chance and will resulted in less significance. But more often genes from these samples will lower expression compared with samples of high number reads -- if there is no reads as an extreme situation, the normalization will not help this. Any number of factor times with zero is zero.

ADD REPLY • link 7.6 years ago by huwenhuo ▴ 40