Entering edit mode
5.2 years ago
Pappu
★
2.1k
I am wondering if it is a standard practice to only include ~20k protein coding genes for DE and subsequent pathway analysis?
Human gencode outputs ~60k genes.
Those ~60,000 will include protein coding, non-coding RNAs (ncRNAs), pseudogenes, and other obscure transcripts, as you already know.
Anyway, it is and it is not standard practice to just focus on the protein coding genes. One reason that we do it is because the protein coding genes are more annotated and there is more literature on these. So, practically, it is just easier to interpret the results when focusing on protein coding genes.
As an example: as you will see by my profile, I work with many different groups. I always ask people whether they want to focus on protein coding and/or ncRNAs. Some may say 'yes', that they would be excited to see the ncRNA results; however, when I send back the results, they (and I) are at a loss as to how to interpret them.
In the past, I have seen people use ncRNAs for, e.g., building networks and doing very focused analyses. For example, one guy at Imperial College London had found evidence of a novel ncRNA that only appeared in ER+ breast cancer, and began exploring that specific ncRNA further. It basically became detectable because this locus in question had elevated expression in cancer, i.e., a level of expression higher than that seen in normal tissue. So, it may just have been as a result of 'transcriptional noise'.