I have mapped fastq files to my transcriptome of interest in Kallisto, and performed a differential expression analysis in Sleuth (control versus sample). Although I obtain a list of differentially expressed loci after performing a Wald test, a look at the read mapping metrics makes me unsure about the data. The mappings I obtain are as follows:
sample reads_mapped reads_proc frac_mapped bootstraps
control_replicate1 39333462 48943096 0.8037 100
control_replicate2 41571482 51237738 0.8113 100
sample_replicate1 15284000 21768028 0.7021 100
sample_replicate2 9515367 21270730 0.4473 100
As you can see, sample_replicate2 is considerably fewer reads mapped than the other sample and controls. Will this impact my differential analysis. Since they are scale in sleuth (i.e. sleuth normalised tpm) will this be an issue?
Thanks for your reply Istvan. Typically though, for RNA-seq, I have heard that you need at least 20million reads for each replicate. Is this true, and do you know what this is based on? or can I get away with the above figures.
As mentioned by Albert, the software will take account of this factor by such as normalization between samples. The biological results from the low read number samples are more less convincing. For example samples of low number reads will have bigger variation by chance and will resulted in less significance. But more often genes from these samples will lower expression compared with samples of high number reads -- if there is no reads as an extreme situation, the normalization will not help this. Any number of factor times with zero is zero.