Hello all,
So, I want to cluster a group of genes with dist and hclust function in R. The data matrix is raw count. I know the genes symbol for the group of genes that I want to cluster. Because I used Ensembl ID for the matrix, I used Biomart for translating the gene symbol into the ensembl id. At this point, I realized several gene is duplicated with different ensembl id. I check one by one for those duplicate genes and decide to remove the alternative sequence genes. So, right now I have a non-duplicate and non-alternative sequence gene list with ensembl ID. Problem occur after I tried to compare the result of hierarchical clustering for list before and after removing the alternative sequence genes. The result is so different that probably will change the meaning of my analysis. My question is, should I filter or not the alternative sequence? If I check the entrez ID, there is only one entrez ID corresponds to gene symbol. So, the problem is in the ensembl ID. Thank you all.
Well, what I want to do is getting gene from a certain GO or pathway like in KEGG. So, I start with entrez ID from KEGG and try to mapp the id to ensembl ID. What I don't know is, what kind of definition is used for GO or KEGG to define what gene is. What is the usual way to do this? I think mapping between GO or pathway is a common method in RNA-seq analysis, right?
In this case, the genes stand for their products (most often proteins) so you don't care about variants. You should them summarize the data for each gene i.e. collapse all the variants of a gene into the same gene. For example, all Ensembl IDs that map to the same Entrez ID/gene symbol could be considered the same gene.