I've performed RNA-seq on 30 cell lines, and am trying to determine if there is an enrichment in oncogenes in genes that are highly expressed (>50 rpkm) across >15 cell lines. Of the ~20,000 annotated mRNAs, there are only ~10,000 mRNAs that are expressed in at least one cell line (rpkm > 1). When I perform my fisher's test, I will be generating a 2x2 matrix comparing highly expressed genes, oncogenes, and all detectable genes.
My question is: should I only consider detectable genes (and detectable oncogenes) when I perform my Fisher's test, or if I should consider all annotated genes?
I'm think I should only consider genes that are detectable in one or more cell lines, and subset the list of oncogenes accordingly. It would be unfair to look for an enrichment among the 20,000 annotated genes when only half of them are actually being expressed, or am I overthinking this problem?
Thank you!
This looks perfect, thank you!