I am working on a data set in which the "gene_symbol" column has multiple symbols in a single cell. For example: "DDX4 , SLC38A9" "CTD-2517M22.17, RECQL4" "CCDC183 , CCDC183-AS1 , RP11-216L13.18 , RP11-216L13.19" "AC108004.1 , DOC2B"
Some even have four names.
My question is: How can I find out a standard symbol where I can replace these two symbols with the single standard gene symbol. Since I am working on huge data sets, an automatic way such as a python script would be a huge help.
See these posts, they may help you. The second post shows
that it's not an easy question at all. Good luck!
Converting BLAST Alignments (NCBI database) to Gene ID
Finding Gene Symbol Synonyms
Using biomaRt to convert gene symbols to entrez id in dataframe of gene-sets
Sometimes gene name depends upon species, approach or database.
Batch query obsolete gene names to get current HGNC symbol
Python Code to standardize gene name in CSV file