Python Code to standardize gene name in CSV file
0
1
Entering edit mode
7.7 years ago

Hi all, I have a CSV file that includes gene names in different formats for example: PD-L1 might be written as PDL1 or PD-L1 or PDL-1 but i want to standardize all of these to HGNC symbols using python code. Can anyone tell me how should I go about it. I know how to do it for just one gene name ( i can manually look for the symbol in HGNC and replace it) But my issue is I might have many gene names. So i want the code to look for the gene name in my csv file and automatically fetch the HGNC symbol for it and replace the existing value with HGNC symbol. Any help would be deeply appreciated. Thank you :)

Python Coding HGNC gene symbol • 3.4k views
ADD COMMENT
2
Entering edit mode

Sounds like you're just going to need lots of regexs..

ADD REPLY
0
Entering edit mode

To elaborate, I would start broad. It depends what sort of input info you have as you only gave us one example but you might be able to do some really simple 'space reduction' first. For example, a regex to remove all hyphens, a regex to remove any spaces/newlines/punctuation etc. Once you've got all the gene IDs in some kind of consistent format (e.g. purely alphanumeric) you might be able to play with specific formats more easily.

ADD REPLY
0
Entering edit mode

I can use regex for symbols. But in general i would want to replace all gene names as is given in HGNC. PD-L1 was just an example.

ADD REPLY
2
Entering edit mode

Have you tried http://www.genenames.org/cgi-bin/symbol_checker as a half-way solution? You could screen scrap this resource with various python libraries since I don't think there is a set API.

ADD REPLY
0
Entering edit mode

I did not know about symbol checker.Thank you for the info.

ADD REPLY
1
Entering edit mode

are you sure you want HGNC symbols? Cause for PD-L1 it is CD274 and not just PDL1 or similar. What exactly do you want and why? Could you please provide more examples of gene names in your input file and desired output? What species used (human?) and by and any idea of where this names are coming from?

ADD REPLY
0
Entering edit mode

Yes, HGNC symbols. For example: My CSV file may have "TS" as an input but HGNC symbol for that is TYMS. So, I want TYMS as output. But I do not want to replace each gene symbol manually. Species is Human and the genes names are coming from Pubmed abstracts( so different authors refer genes differently, I want all of them be the HGNC ones).I hope this is clear.

ADD REPLY
0
Entering edit mode

This is clear. This is an interesting task to solve.

Not sure if there is a standard good solution if it is any name in PubMed abstract.

Could you please upload an input file somewhere and share it here? I will take a look over the weekend.

Do you have any means to validate at least manually (but by a professional) if the output generated is correct? Is there a possibility that some genes mentioned are from model organisms for human and if yes, can you extract info from abstract on which species were used for each gene name and each abstract. Do you want human ortholog name or original gene's HGNC from that model organism? By any chance, do you have access to the whole publication text or transcript IDs extracted from them?

Is this a research task for public university/institute or a commercial application?

ADD REPLY
0
Entering edit mode

If your example is at all literal (if it ends up being a dash in different places in the gene), I would just run that column in the CSV through string.replace('-', ''), so now all of the instances you gave would be "PDL1" and you could match your gene to that.

I'm guessing it's probably not that simple. So regexs would probably be the way to go.

ADD REPLY

Login before adding your answer.

Traffic: 2778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6