Question

How do I update the gene names of TCGA data correctly?

0

Entering edit mode

5 months ago

Ivana • 0

Dear all,

I want to work with RNA-seq TCGA data and as I am working with a list with genes of interests that is annotated based on the latest update of HGNC (04.06.2024), I wanted to do the same with the TCGA gene names. However, when I do this (using the ensembl gene ID), there are roughly 3,800 genes that I cannot match. I also tried to match the names but there are even more genes that do not match.

I am still a beginner in bioinformatics and I would be greatful for any tips or suggestions on how to annotate/up-date the TCGA gene names!

Thank you!

Best,
Ivana

HGNC TCGA • 738 views

ADD COMMENT • link updated 5 months ago by Zhenyu Zhang ★ 1.2k • written 5 months ago by Ivana • 0

0

Entering edit mode

What do you mean by "cannot match"? Can you give us an example?

ADD REPLY • link 5 months ago by Ram 44k

0

Entering edit mode

This is a classic bioinformatics question, and there are no standard way to do so. You are balancing your mappings between FPs and FNs.

I normally ensemble all the following mappings

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA//GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
org.Hs.eg.db gene
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgAlias.txt.gz (excluding ENST transcript and uc transcript b/c I don't trust them)

And then you can setup a rule. My rule is if a mapping is not unique, I will manually inspect it.

ADD REPLY • link 5 months ago by Zhenyu Zhang ★ 1.2k

score 0 · Answer 1 · 2024-06-06

0

Entering edit mode

5 months ago

ATpoint 85k

Updates in gene annotations might result in deprecation of certain gene names, while others might be added. If you use existing TCGA quantifications (without starting from fastq files) then I would really just focus on gene IDs that currently have a match in the recent annotation you want to use, and give the others some dummy name, like missing_[0-9]+. There is no point trying to force-match them somehow. If they're gone in recent HGNC then they're gone. Only true "good" workaround would be to process TCGA from fastq files on, but that is access-restricted and tedious. You cannot expect old annotations to perfectly match recent ones. If that was the case then the new annotations would be pointless, no?

Or just use the existing annotations in the TCGA databases, without updating. Is it really critical to "update" here?

ADD COMMENT • link 5 months ago by ATpoint 85k

0

Entering edit mode

Hi ATpoint, thank you for your suggestion and feedback on it! What I want to achieve is just basically check expression levels for my genes of interest and as they have been annotated with the latest update of HGNC, I thought I need to do the same with the TCGA gene names. My concern was that I might not be able to find my genes of interest just because TCGA gene names are still using some "previous" names. Going with dummy names is an interesting idea. I will give it a try! Thank you!

ADD REPLY • link 5 months ago by Ivana • 0

0

Entering edit mode

Can't you check by Ensembl ID? These should be constant.

ADD REPLY • link 5 months ago by ATpoint 85k

0

Entering edit mode

Yes they might be constant but the problem is rather that a gene can be linked to multiple Ensembl IDs..

ADD REPLY • link 5 months ago by Ivana • 0

0

Entering edit mode

the problem is rather that a gene can be linked to multiple Ensembl IDs..

That's by design. Restrict yourself to canonical chromosomes and you should see 1<->1 mapping for the most part (except pseudogenes, miRNA, PAR genes etc.)

ADD REPLY • link 5 months ago by Ram 44k