Entering edit mode
6.7 years ago
qwzhang0601
▴
80
We are studying repeat elements in mouse (comparing samples between 2 conditions with 3 replicates each). We plan to do RNA-seq sequencing for all coding genes and repeat elements. For the reads we will do 150bp paired-end sequencing. But we are not sure how many reads are enough for the differential analysis. Did anyone have such experience and can give us some suggestions?
Many thanks.
Here is a link for general recommendations about number of reads for RNAseq.
While that will work fine for normal genes, repeat elements are going to be inherently problematic. Having long reads (150 bp) would allow for more specific alignments but you would still have the problem of how to handle multi-mappers. Remember that most counting programs will not count reads that multi-map be default. You will need to force them to do so.
There are a couple of pipelines to analyse TE / repeat regions, like TEToolkit or SalmonTE. I guess the more problematic issue is the annotation, and both pipelines include custom annotation - at least for human.
The number of reads will depend on if you want to perform analysis on individual repeat elements - analogous to transcript expression - or if you want to aggregate counts over families of repeats - analogous to gene expression. For regular RNAseq analyses, if you want to perform differential gene expression, the recommendation is between 10-20 million reads per sample; for differential transcript expression, I think the recommendation is 50 million reads. I guess one would need to increase read count per sample, as most reads will map to regular genes. Maybe the papers describing TEToolkit or SalmonTE have more precise recommendations.
Keep in mind increasing the number of biological replicates also increases the power of the statistical analysis, in general more than increasing sequencing depth per sample - provided you don't go bellow the above guidelines of reads per sample.
Thank you for your suggestion. I plan to use TEtranscript, which aggregate counts over sub-families.