Hi everyone, I have a list of differentially expressed lncRNAs (from RNAseq) and I would like to do GO and KEGG pathway analysis with them to find out which GO terms and pathways are getting the most upregulated/downregulated in my condition vs control. The identifier present in all the lncRNAs is their ENSEMBL gene ID. Other identifiers suck as SYMBOL or ENTREZID cannot be mapped to all of the lncRNAs ( 3724 out of a total list of 13868 lncRNAs, with 1685 being differentially expressed).
I have tried clusterProfiler which works with the lof2FC of all the lncRNA genes and uses ENSEMBL ID as the identifier. It worked well with mRNA genes and I got a result table with top upregulated and downregulated GO terms.
#suppose I have a dataframe 'df' with 2 columns for ENSEMBLID and log2fc for all lncRNAs
list <- df$log2fc
names(list) <- df$ENSEMBLID
list <- order(list, decreasing=T) #arrange in decraesing order of log2fc
gse=gseGO(geneList = list, ont = "ALL", OrgDb= Org.Hs.eg.db) #rest I keep defaults
gse@result
For mRNA genes, this gives me a result table with Go terms and p value etc in it but for lncRNA genes, it gives me a table with 0 rows.
Similarly, I tried using the gage package which uses ENTEZID as identifiers. This worked well with mRNA genes and once again I got the top enriched GO terms and KEGG pathways. For lncRNA genes, I did get a results table for GO terms (BP, MF and CC) separately, but BP with lowest p value is 0.02 and only 12 BP terms have p < 0.05. Similarly, MF and CC have very few terms with p < 0.05. The KEGG pathways results table is all NA. Also, since only 3724/13868 ENTRZIDs could be mapped, I am not sure if using this as the identifier is a good idea.
So, I am looking for tools (R based or web based) which can be used to GO term analysis and KEGG pathway analysis of lncRNA genes using ENSEMBLID as the identifier. Any help will be very much appreciated. Thanks :)
I think that the problem is that lncRNA <-> GO/KEGG mapping is pretty poor (especially KEGG). In other words, I don't think it's a technical issue, I think it's a real issue.
Yeah.. even I think it's the main problem.. but 10.1186/s12865-019-0297-9 and 10.3233/CBM-190215 .. these two papers have used DAVID web servers for finding out GO and KEGG pathways for lncRNA genes. They used a slightly different version of the annotation GTF (Homo_sapiens.GRCh38.83.gtf), whereas I used GENCODE v30. So I thought maybe its a problem with the ENSEMBL IDs, but when I do an intersect of my ENSEMBLID list with the total EMSEMBLID list of this GTF, I find that almost all of my IDs are matching.. So, its not a problem with the ENSEMBLIDs. Also, when I try to use DAVID web server, I don't get any results.