How to handle duplicated genes in TCGA data?
1
0
Entering edit mode
6 months ago
Ngrin • 0

I am working with TCGA data downloaded from GDC portal. For mRNA data I selectively downloaded star_gene_counts.tsv files. However, the columns are in the format of GeneSymbol___ENSMBLEid, ex, SCYL3___ENSG00000000457.14. I need to only have gene symbols, which brings up duplicated gene symbols. As the first taught, I have written the below code to keep highest variable column per gene symbol. Any thoughts?

mRNA_genes <- gsub("___ENS.*", "", rownames(mRNA_data)) %>% unique()

# Group columns by gene name
mRNA_gene_data <- pbmcapply::pbmclapply(mRNA_genes,  mc.cores = 10, function(gene){
  gene_rows <- grep(paste0("^", gene), rownames(mRNA_data))
  gene_data <- mRNA_data[gene_rows, , drop = FALSE]
  if(length(gene_rows) == 1){
    gene_data
  }else{
  max_variance <- apply(gene_data, 1, function(x) var(x, na.rm = TRUE)) %>% which.max(.)
  gene_data[max_variance,]
  }
}) %>% do.call(what = "rbind", .) %>% as.data.frame() %>% `rownames<-`(mRNA_genes)
TCGA GDC mRNA • 793 views
ADD COMMENT
0
Entering edit mode
6 months ago
txema.heredia ▴ 190

Having duplicate gene symbols is pretty common. Why do you need to work with gene symbols? To compare them to external datasets? Or just for representation/readability?

You could use the ensembl_id as the gene index and have an additional metadata column with the symbol. Or you could make a list of genes with duplicated symbols, and use just the symbol for the unique genes, and symbol_ensembl_id for the duplicated ones.

ADD COMMENT
0
Entering edit mode

Second what is said above, ensembl_id is the unique identifier. As a good bioinformatics habit, always use unique identifier for your analysis, and add your gene symbol at the end of analysis.

ADD REPLY
0
Entering edit mode

Thanks @txema.heredia.The tool I am going to use only accepts gene symbols. This is the reason. So I cannot use ENS IDs.

ADD REPLY
0
Entering edit mode

Why does the tool require gene symbols and not ensembl ids? Is it retrieving information from somewhere else? If so, your best bet is to find what is exactly that source of extra information and which version of ensemble (or whatever) does it use.

Once you have that, check what is the ensemble_id of some of those genes with duplicate symbols. And then, you'll have two options: either filter out your data keeping only one of each duplicated symbol (according to what the tool's database has), or re-do your mapping using that exact ensembl version.

Or, make a list of all the mismatched symbols between your list and the tool's DB, and try to find a way to convert/rename the symbols from one version to the other.

ADD REPLY

Login before adding your answer.

Traffic: 2059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6