Dear all,
I'm busy with an RNA-seq analysis of case and control samples and got some differentially expressed coding genes and long non-coding RNA (lncRNA) by edgeR. I would like to do an integrative lncRNA-mRNA analysis; the library size of cases is small (about 2 million raw counts) compared to controls (about 60 million raw counts), so I filtered the genes with CPM value of less than 5 during the edgeR analysis. Given that the lncRNAs usually have the low expression value, I'm concerned about the CPM threshold as some lncRNA may miss during the analysis. Could you please share your idea about the analysis?
Thanks a lot
Many thanks, Kevin for your always help. Actually, I'm not concerned about the large library size differences. My issue is the CPM cutoff for the analysis. As far as I know, the read with the count of less than 10 should be usually removed, which is equivalent to CPM of 0.5 for the library size containing 20 million reads. Now, as I mentioned in the post, the library size of my patient samples is about 2 million reads, so I forced to set the high CPM cutoff (5) to filter the low count read (less than 10). But, here, many lncRNAs may miss from the analysis and is indeed my problem. Could you please let me know if you have any suggestions?
Apart from trying different cut-offs, I have no more suggestions.
Thanks. Sorry, if do you suggest the CPM cutoff of 1 for the library size of 2 million reads? Please kindly let me know what I should look for in the output of different cutoffs?
I still don't know what is your idea for integrating these datasets, which is important to understand; so, I am limited in how I can advise on specifics There is no right or wrong here - you can apply the same cut-off for both, or use a different cut-off. Then proceed with your analysis, with the view that you can always go back and modify certain parameters. Having many low-expressed genes in your dataset will affect things like p-value correction, amount of required RAM, fold-change calculations, PCA, clusterting, etc.
You just have to make an 'executive' decision with your own project, and then move forward with your analysis. Again, you can always later go back to modify things.
My goal is to do an integrative lncRNA-mRNA analysis to find the lncRNAs and their target genes that related to a given disease as well as to understand the corresponding regulatory role of the lncRNAs. Yes, Kevin I usually go ahead with the analysis and go back to do again. However, consulting with other experienced peoples, like you is always valuable for me.
Oh I know, but how you do that integration is important. Anyway, feel free to ask more questions!