Dear all, we have the RNA-SEQ data for 7 timepoints for a species, and this species has a reference genome. I found some papers will identify the gene number in each timepoint, but they used different method. 1. Some "an ad hoc cutoff for detectable expression was set at >=2 reads per transcript. Using this cutoff, 11187 gene transcripts could be detected in the RNA-Seq data set. " 2. Some using" we counted the number of clean reads aligned to litchi gene sequences and performed normalization using the RPkM method. After lowly expressed genes (< 5RPKM) were filtered, we identified 17572 genes in all samples"
It seems one of the method is using cutoff for the reads count directly, and another method works on the normalized data. So my first question is that which one should I use..(my species is channel catfish). And my second question is that, each of the method removed the low expressed genes, right? But for counting the expressed genes in different timepoints, I think there is no need for us to filter the low expressed genes (since these genes also expression...)
I will be appreciated if you could help solve this problem..
Thank you!!
It depends on what you want to do. I do not think that RNA-seq alone allows for a confident statement about which gene is truly and reliably expressed, especially in terms of drawing a border between lowly-expressed genes and the transcription "background noise". If your aim is statistical (DEG) analysis I would focus on those genes that have sufficient statistical power to come out as significantly different at the given replicate number and read depth. This could be done using the
FilterByExpr
function from edgeR. Longer genes will be preferred though since no length correction is performed by default.Personal opinion in this context: In the end I think that most high-throughput experiments are only and truly informative if doing any kind of comparative analysis where one has sufficient replicates, the replicates have been treated and processed identically and the analysis is statistically valid. Without that any changes one might see can be due to different read length, library prep technologies, signal/noise ration (=data quality), in silico processing etc. That comes down better making relative (comparative) statements with NGS experiments rather than absolute ones like here "exactly this number of genes is expressed in this sample".