Which gene name should I choose?
2
0
Entering edit mode
5.0 years ago

I am working on a data set in which the "gene_symbol" column has multiple symbols in a single cell. For example: "DDX4 , SLC38A9" "CTD-2517M22.17, RECQL4" "CCDC183 , CCDC183-AS1 , RP11-216L13.18 , RP11-216L13.19" "AC108004.1 , DOC2B"

Some even have four names.

My question is: How can I find out a standard symbol where I can replace these two symbols with the single standard gene symbol. Since I am working on huge data sets, an automatic way such as a python script would be a huge help.

gene genome • 1.3k views
ADD COMMENT
0
Entering edit mode

See these posts, they may help you. The second post shows

that it's not an easy question at all. Good luck!

Converting BLAST Alignments (NCBI database) to Gene ID

Finding Gene Symbol Synonyms

Using biomaRt to convert gene symbols to entrez id in dataframe of gene-sets

Sometimes gene name depends upon species, approach or database.

Batch query obsolete gene names to get current HGNC symbol

Python Code to standardize gene name in CSV file

ADD REPLY
4
Entering edit mode
5.0 years ago

The issue here is not that the symbols represent the same gene and that you need to choose one over the other. The issue, at least in the one example that you provide, is that DDX4 and SLC38A9 are genes that happen to have at least one transcript that overlaps in GTEx:

Captura-de-tela-de-2019-11-25-08-47-39

[UCSC direct link (will expire): https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtMo... ]

As you can see, evidently, in one or more tissues from GTEx, SLC38A9 was found to have a transcript that extends into DDX4.

I would review the analysis pipeline that produced this data in order to see why you have annotation like this, and then you will be better informed about how to proceed.

Kevin

ADD COMMENT
0
Entering edit mode

Since I am new to genomics, I do not understand a lot of terminologies at the moment. However, it would be great if you could let me know how I should proceed about it. I have also edited the question by adding a few more cases where multiple gene names appear in the same cells for you to see if they also have overlaps in their transcriptions or they just happen to be aliases. In any case, thank you Kevin.

ADD REPLY
1
Entering edit mode

I just looked at one other pairing and it's the same idea, i.e., they overlap: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&las...

You can search for the others in the UCSC Genome Broweser to check. In some cases, they may be on opposite strands, as genomax points out, while in others they may be on the same.

ADD REPLY
3
Entering edit mode
5.0 years ago
GenoMax 147k

The gene names in all capital letters that are in gene_name field (it appears that data you are using probably used GENCODE or Ensembl GTF file) are official human gene names. Human gene nomenclature committee assigns these names.

Examples above also are transcribed in opposite direction.
ddx4

ADD COMMENT
0
Entering edit mode

Alright. I just want to know if they are aliases? And can I chose one over the other or is there a standard where the names are updated that I can refer to. Clearly the link you provided have both DDX4 and SLC38A9 in it with their own detailed descriptions.

ADD REPLY
1
Entering edit mode

Those are not aliases. They are two separate genes that are transcribed on opposite strands in same region.

Can you tell us how the file you are working with was generated? It is possible that your analysis was looking at genome regions (based on coordinates) and as a result multiple gene names may be included in that interval and thus in your file.

ADD REPLY
0
Entering edit mode

I took the file from TCGA data repository; did not generate it myself

Program: TCGA, Project: HNSC, Site: Floor of mouth, Data Category: DNA Methylation.

ADD REPLY
0
Entering edit mode

Those are most likely intervals of co-ordinates and thus cover multiple genes.

ADD REPLY

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6