I am working with TCGA data downloaded from GDC portal. For mRNA data I selectively downloaded star_gene_counts.tsv files. However, the columns are in the format of GeneSymbol___ENSMBLEid, ex, SCYL3___ENSG00000000457.14. I need to only have gene symbols, which brings up duplicated gene symbols. As the first taught, I have written the below code to keep highest variable column per gene symbol. Any thoughts?
mRNA_genes <- gsub("___ENS.*", "", rownames(mRNA_data)) %>% unique()
# Group columns by gene name
mRNA_gene_data <- pbmcapply::pbmclapply(mRNA_genes, mc.cores = 10, function(gene){
gene_rows <- grep(paste0("^", gene), rownames(mRNA_data))
gene_data <- mRNA_data[gene_rows, , drop = FALSE]
if(length(gene_rows) == 1){
gene_data
}else{
max_variance <- apply(gene_data, 1, function(x) var(x, na.rm = TRUE)) %>% which.max(.)
gene_data[max_variance,]
}
}) %>% do.call(what = "rbind", .) %>% as.data.frame() %>% `rownames<-`(mRNA_genes)
Second what is said above, ensembl_id is the unique identifier. As a good bioinformatics habit, always use unique identifier for your analysis, and add your gene symbol at the end of analysis.
Thanks @txema.heredia.The tool I am going to use only accepts gene symbols. This is the reason. So I cannot use ENS IDs.
Why does the tool require gene symbols and not ensembl ids? Is it retrieving information from somewhere else? If so, your best bet is to find what is exactly that source of extra information and which version of ensemble (or whatever) does it use.
Once you have that, check what is the ensemble_id of some of those genes with duplicate symbols. And then, you'll have two options: either filter out your data keeping only one of each duplicated symbol (according to what the tool's database has), or re-do your mapping using that exact ensembl version.
Or, make a list of all the mismatched symbols between your list and the tool's DB, and try to find a way to convert/rename the symbols from one version to the other.