Python script to query GeneCards to get EntrezID, symbol from Ensembl geneID

Tool:Python script to query GeneCards to get EntrezID, symbol from Ensembl geneID

2

Entering edit mode

4.2 years ago

Shred ★ 1.6k

UPDATE: Genecards recently introduced an explicit policy asking to do not use automated scraping tool. While the provided solution will (probably) still work for a single entry, multiple queries may be blocked and represents a policy violation.

Hi guys, GeneCards is one of the most comprehensive repository for gene info. While doing RNA-seq analysis, the common task to convert Ensembl geneID to Entrez ID with tools like BiomaRt may leave lots of genes without a corresponding Entrez number.

Although much of the losses are from pseudogenes or very poor characterized genes, someone may be interested to evaluate this loss, expecially if some of the lost genes populate the DEG list.

I've written this quick script in Python to query GeneCards to retain info about these uncharacterized genes. Please keep in mind that this must not be intended as a replacement for BiomaRt or other tools: GeneCards implements a protection against automatic queries and so would be impossible to assign correspondant ID to every gene in your analysis. I've made some tests with list of 150-200 genes and it seems to be ok. If you keep displaying the same error, consider to run a smaller gene list.

It can be run in a gene list mode, --list, where a file composed by 1 Ensembl geneID per line will be submitted, or in a single gene mode, --gene, passing a single ensembl geneID.

I'm working to bypass the website protection while minimizing the impact on users. Any suggestion are welcome.

Code -->

	#!/usr/bin/python3

	from bs4 import BeautifulSoup
	import requests
	import html
	import time
	import argparse

	'''
	Made in a boring day by @danilotat. Enjoy

	'''
	def get_symbol(html_file, ensgid):
	soup = BeautifulSoup(html_file, 'html.parser')
	try:
	symbol_tag=soup.find('title').contents[0]
	symbol = symbol_tag.split(' ')[0]
	if symbol == ensgid:
	return 'NaN'
	else:
	return symbol
	except AttributeError:
	print('Genecards is blocking your requests. Please try again later with less geneid')
	exit()


	def gene_card_request(ensgid):
	'''
	Passing Ensembl gene id, return its webpage on Genecards
	'''
	url_to_request='https://www.genecards.org/cgi-bin/carddisp.pl?gene=' + ensgid
	headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
	r = requests.get(url_to_request, headers=headers)
	gene_card_html = html.unescape(r.text)
	return gene_card_html

	def get_entrez(html_file):
	'''
	Given a valid Ensembl gene id, returns Entrez gene id
	'''
	soup = BeautifulSoup(html_file, 'html.parser')
	entrez_href = soup.find_all('a', {'title':'NCBI Entrez Gene'})
	# usually it will return redundant refer: we'll take just the first of them
	try:
	link = entrez_href[0]
	ncbi_url=link.get('href')
	entrez_id = ncbi_url.split('/')[-1]
	return entrez_id
	except IndexError:
	# empty list: no Entrez ID
	return "NaN"

	def fill_dict(gene_list):
	''' Fill a dict with Ensembl geneid, entrez, symbol
	'''
	conv_dict={}
	with open(gene_list, 'r') as engid_list:
	for line in engid_list:
	ensgid = line.rstrip()
	gcard_page = gene_card_request(ensgid)
	entrez_id = get_entrez(gcard_page)
	symbol = get_symbol(gcard_page, ensgid)
	conv_dict.setdefault(ensgid, []).extend([entrez_id,symbol])
	time.sleep(5)
	return conv_dict

	if __name__ == '__main__':
	parser = argparse.ArgumentParser(description="Giving a list of valid Ensembl gene ID, query GeneCards to get Entrez and symbols.")
	parser.add_argument("--list", help="Input list of Ensembl gene ID, one per line.")
	parser.add_argument("--gene", help="Single gene mode")
	args = parser.parse_args()
	if (args.list == None and args.gene == None):
	parser.print_help()
	exit()
	if args.gene == None:
	res_dict = fill_dict(gene_list=args.input)
	for ensgid in res_dict.keys():
	print('{},{},{}'.format(ensgid,res_dict[ensgid][0], res_dict[ensgid][1]))
	else:
	gcard_page = gene_card_request(args.gene)
	print('{},{},{}'.format(args.gene,get_entrez(gcard_page), get_symbol(gcard_page, args.gene)))

view raw GeneCards_query.py hosted with ❤ by GitHub

Requires BeautifulSoup

Ensembl biomart annotation • 4.3k views

ADD COMMENT • link 16 months ago by Shred ★ 1.6k

0

Entering edit mode

16 months ago

Anya • 0

Here's a slightly changed version of this script.

I changed line 42 from 'title':'NCBI Entrez Gene' to 'title':'NCBI Gene' ).

And line 76 from res_dict = fill_dict(gene_list=args.input) to res_dict = fill_dict(gene_list=args.list)

Here's an example of how it works:

python3 get_from_gene_cards.py --list ensembl_ids.txt

Where ensembl_ids.txt looks like this:

ENSG00000018607
ENSG00000042304
ENSG00000064489

.

#!/usr/bin/python3

from bs4 import BeautifulSoup
import requests
import html
import time
import argparse

'''
Made in a boring day by @danilotat. Enjoy
'''
def get_symbol(html_file, ensgid):
    soup = BeautifulSoup(html_file, 'html.parser')
    try:
        title_tag = soup.find('title')
        if title_tag and title_tag.contents:
            symbol_tag = title_tag.contents[0]
            symbol = symbol_tag.split(' ')[0]
            if symbol == ensgid:
                return 'NaN'
            else:
                return symbol
        else:
            return 'NaN'
    except AttributeError:
        print('Genecards is blocking your requests. Please try again later with less geneid')
        exit()

def gene_card_request(ensgid):
    ''' 
    Passing Ensembl gene id, return its webpage on Genecards
    '''
    url_to_request = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=' + ensgid
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    r = requests.get(url_to_request, headers=headers)
    gene_card_html = html.unescape(r.text)
    return gene_card_html

def get_entrez(html_file):
    ''' 
    Given a valid Ensembl gene id, returns Entrez gene id
    '''
    soup = BeautifulSoup(html_file, 'html.parser')
    entrez_href = soup.find_all('a', {'title': 'NCBI Gene'})
    # usually it will return redundant refer: we'll take just the first of them
    try:
        link = entrez_href[0]
        ncbi_url = link.get('href')
        entrez_id = ncbi_url.split('/')[-1]
        return entrez_id
    except IndexError:
        # empty list: no Entrez ID 
        return "NaN"

def fill_dict(gene_list):
    ''' Fill a dict with Ensembl geneid, entrez, symbol
    '''
    conv_dict = {}
    with open(gene_list, 'r') as engid_list:
        for line in engid_list:
            ensgid = line.rstrip()
            gcard_page = gene_card_request(ensgid)
            entrez_id = get_entrez(gcard_page)
            symbol = get_symbol(gcard_page, ensgid)
            conv_dict.setdefault(ensgid, []).extend([entrez_id, symbol])
            time.sleep(5)
    return conv_dict

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Giving a list of valid Ensembl gene ID, query GeneCards to get Entrez and symbols.")
    parser.add_argument("--list", help="Input list of Ensembl gene ID, one per line.")
    parser.add_argument("--gene", help="Single gene mode")
    args = parser.parse_args()

    if args.list is None and args.gene is None:
        parser.print_help()
        exit()

    if args.gene is None:
        res_dict = fill_dict(gene_list=args.list)
        for ensgid in res_dict.keys():
            print('{},{},{}'.format(ensgid, res_dict[ensgid][0], res_dict[ensgid][1]))
    else:
        gcard_page = gene_card_request(args.gene)
        print('{},{},{}'.format(args.gene, get_entrez(gcard_page), get_symbol(gcard_page, args.gene)))

ADD COMMENT • link 16 months ago by Anya • 0

1

Entering edit mode

Genecards added a footer saying that

SCRAPING AND OTHER AUTOMATED DOWNLOAD AND USE OF GENECARDS DATA STRICTLY PROHIBITED

While this is just a crappy script and surely LLM competitors already scraped the whole database, consider using other sources.

ADD REPLY • link 16 months ago by Shred ★ 1.6k

0

Entering edit mode

Do you have any recommendations of such sources or how to search for them? I'm a newbie and I need to convert GRCh38 Ensembl IDs to Entrez IDs. I thought that such list should already exist somewhere (it's a human genome after all) but I can't find any to my surprise.

I already tried g:Profiler, BioMart and gget, but they only give me partial results

ADD REPLY • link 16 months ago by Anya • 0

1

Entering edit mode

In this location there is an "entrez" file which maps ensembl IDs to Entrez. It's probably as definitive as you're going to find:

https://ftp.ensembl.org/pub/release-111/tsv/homo_sapiens/

ADD REPLY • link 16 months ago by Mike Smith ★ 2.1k

0

Entering edit mode

only give me partial results

It is possible that not every Ensembl ID is going to convert to Entrez ID.

Have you tried:

Ensembl ID to ENTREZ best converter

ADD REPLY • link 16 months ago by GenoMax 154k

0

Entering edit mode

Tried it just recently with almost the same results. 456 out of 496 IDs for protein-coding genes returned as "NA" even though I can see NCBI IDs for these genes on GeneCards T_T

I think I'll try to convert Ensembl IDs to HGNC IDs and then will try to use HGNC API to convert them further.

ADD REPLY • link 16 months ago by Anya • 0

0

Entering edit mode

Please edit your original post and add this at the top so everyone reads that first. GeneCards politely asks people not to scrape - not respecting that will get them to use more extreme measures so most users will suffer.

ADD REPLY • link 16 months ago by Ram 45k

Login before adding your answer.