Question

How to deal with duplicated gene IDs in TCGA RNA expression data?

0

Entering edit mode

18 months ago

Camilo Andres ▴ 40

Hi! I am trying to analyse genes that could be differentially expressed between two groups of patients with breast cancer from the TCGA data, but I noticed that some genes are duplicated (using Gene Entrez IDs), with completely different levels of expressions between them, so, how can I deal with it? should I just remove duplicates or there is a more sophisticated way to do it without loosing the data?

Thanks in advance!

TCGA Expression mRNA • 1.5k views

ADD COMMENT • link updated 18 months ago by LauferVA 4.5k • written 18 months ago by Camilo Andres ▴ 40

0

Entering edit mode

are they distinct transcript IDs but the same gene ID? can you provide any additional context?

ADD REPLY • link 18 months ago by LauferVA 4.5k

0

Entering edit mode

So sorry, is data from the TCGA project. The only columns of data it has, are Hugo symbols, Gene Entrez IDs and Tumor Sample Barcodes. I don't really know how to deal with those duplicates.

ADD REPLY • link 18 months ago by Camilo Andres ▴ 40

0

Entering edit mode

ok interesting.

Do all, most, some, or a few of the genes have repeated lines. Is there a handy link to the exact file you are looking at?

ADD REPLY • link 18 months ago by LauferVA 4.5k

0

Entering edit mode

few of them, like 25, according to Gene_Entrez_IDs. They are the data from the pancancer project corresponding to breast invasive carcinoma, downloaded from cBioportal.

The data I am looking at is the "data_mrna_seq_v2_rsem_zscores_ref_all_samples". As I read on other forums, I think it could be because of different transcripts, but since they don't have transcript IDs, it is impossible to know wether it is so or not.

Many thanks for your help

ADD REPLY • link 18 months ago by Camilo Andres ▴ 40

0

Entering edit mode

I also said that on this forum xD.

But anyway, actually its probably not impossible. What I would do is first read through the FAQ, paying particular attention to every part having RSEM.

Doing that actually leads you back to Biostars, specifically to this post. Note the bit about:

.junction_quantification.txt

.rsem.genes.results

.rsem.isoforms.results

.rsem.genes.normalized_results

.rsem.isoforms.normalized_results

.bt.exon_quantification.txt

what i would do is find two files from an identical source, one being the .rsem.genes.results file and one being the file with suffix .rsem.isoforms.results. Compare these files. If they differ by the same number of lines as the number of redundant gene entries, the isoform idea becomes a highly likely alternative that can be verified. how? well, for instance:

another way to go about the same thing would be to go to the RSEM documentation, and to read the file descriptions to see if mention is made of this issue.

Finally, another approach is to email cBioPortal support itself with this exact issue. Link this post, and provide links to the exact file. They should be able to tell you.

ADD REPLY • link 18 months ago by LauferVA 4.5k

0

Entering edit mode

Where did you get the data from? which reference build and gene model is it using? There are many versions of TCGA expression data around. I assume if Gene Entrez IDs are not unique in the particular files you have, have you looked at other columns, such as hugo symbols, to see if they are unique?