I am trying to analyze Human endogenous retrovirus RNA-seq data between two groups. I am doing differential gene expression to see what genes are upregulated and downregulated in control vs. disease. I am having a bit of a hard time since many genes are very lowly expressed or no expression getting any sort of significant difference. I found this paper https://www.frontiersin.org/journals/aging-neuroscience/articles/10.3389/fnagi.2023.1186470/full#B65 where they filter genes first to get rid of low expressed genes. I just want to make sure I am not manipulating that data. How should I filter the data (i.e. take avg of raw gene count and filter, take mean, etc.)
However, in the case of using DESeq2 prefiltering is not required for statistical purposes, rather more for memory efficiency and speed of computation.
It's important to understand why you aren't getting significantly differentiated results - have you visualised the data using basic approaches like PCA? Do you see your data splitting across the components based on your two groups? Have you taken the top 50 genes by logFC and plotted a heatmap to see how expression is different across the groups? Do you know what kind of expression differences you could expect - for example what are the typical differences reported in endogenous retrovirus in other groups?