Entering edit mode
7.1 years ago
maria.traka
▴
20
Hi, I'm having an issue with adding names to the gene expression of a bg object. It seems that the texpr$gene_id has multiple entries for the same MSTRGx (as expected I guess considering it's different isoforms for the same gene) but unfortunately for some of the genes the first one of the texpr entries is "." and not the actual gene name. This results in my gene names having lots of ".". How can i work around this? I am missing a lot of genes here from all my downstream functional analysis. Can you help? Thanks, Maria
gene_expression_ESC = gexpr(bg_ESC_89)
indicesG <- match(rownames(gene_expression_ESC), texpr(bg_ESC_89, 'all')$gene_id)
gene_names_F <- texpr(bg_ESC_89, 'all')$gene_name[indicesG]
gene_names_T <- texpr(bg_ESC_89, 'all')$t_name[indicesG]
gene_expression_ESC_N <- data.frame(geneNames=gene_names_F,ensIDs=gene_names_T, gene_expression_ESC)
are there any genes/transcripts in reference gtf starting with "."? Validate reference gtf. If there are no issues with gtf, you can filter out those genes starting with "." from texpr object.
I'm using the Ensembl Homo_sapiens.GRCh38.89.gtf dowloaded from their ftp site so it's not that. I suspect these are putative novel isoforms of known genes that are listed and because they happened to be listed before the known transcripts match is hitting those. I have now managed a workaround where as you suggest i remove the "." entries from the texpr object but it seems very convoluted to me. Anyhow, here it is:
Has anyone else had the same problem? I have to say i bumped into this problem when i was looking for something completely different... I can't think why this would be unique to my data...