I completed a differential gene expression analysis and I essentially have close to 3000 genes that pass all filters (adj p-value and log2FC). I want to run some sort of downstream analysis (GSEA or something else) on these genes so I have tried annotating the genes with both biomaRt and org.hs.eg.db. In both cases, the genes that cannot be annotated are described as "novel transcripts" and "pseudogenes", and "antisense". There are about 700/3000 genes that have these descriptions and I am wondering if there is any way to resolve this. Using biomaRt improved the number of genes with annotation but there are still many that cannot be annotated. Should I throw away these genes to make downstream analysis easier? What if I throw away something important? Is this too many genes to throw away? I am stuck because I cannot seem to find a way to recover any more genes with annotation. I am using the correct reference chromosome (GRCh38.p13) and my data is in the form of ensemble ID's so using biomaRt should give me the most annotations, but it does not. Attached is a photo of the description of some of these genes that cannot be annotated. What should I do?
It makes no difference I think if you toss or not. Any enrichment analysis is focused on known pathways, and almost all genes in known pathways (such as REACTOME or KEGG) are protein-coding, so you anyway won't get much meaning out of these "exotic" types of genes, such as pseudogenes and antisense. One often simply has no idea what they do.