Hi everyone!
I'm new here, but thought I'd ask for some advice. I'm a college student working on a bioinformatics project doing time course (TC) differential expression (DE) analysis of an RNA-seq dataset. By simulating TC RNA-seq data (and controlling the # of DEGs, direction of DE, etc), I hope to benchmark combinations of normalization methods and TC DE tools for later application to our actual dataset.
I'm using Polyester to simulate the read count data. However, given the unique design of our experiment (10 TPs, 3 biological replicates per TP), I can't use the standard simulate_experiment() function. Instead I used the simulate_experiment_countmat() function which gives greater flexibility, and where I can specify DE at individual time points.
Here is where I run into trouble: the resulting count matrix poorly reflects real RNA-seq data, as nonDE genes have constant/unchanging read counts over time. I want to add noise to the read count data but am unsure how much noise to add. Should I obtain gene-wise dispersion estimates (using DESeq2) and then calculate the variance using the variance of a NB distribution formula? Or is there a better way to go about this?
I would greatly appreciate any feedback from those with experience in Polyester or TC RNA-Seq analysis!
Many thanks,
Ethan