Question

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

1

Entering edit mode

4.4 years ago

Damianos P. Melidis ▴ 40

For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.

Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/

From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:

# first get UIDs for clinvar records of the same position
    # credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
    chr = variants["chr"].split("chr")[1]
    start, end = str(variants["start"]), str(variants["end"])

    es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
    genomic_pos = chr + "[chr]" + " AND " + start + ":" + end  # + "[chrpos37]"
    entrez_query = es.inquire(
        {'db': 'clinvar',
         'term': genomic_pos,
         'retmax': 100000,
         'retstart': 0,
         'rettype': 'uilist'})  # 'usehistory': False
    entrez_uids = entrez_query.get_result().uids

Then I have used Entrez from BioPython to get the available ClinVar records:

# process each VariationArchive of each UID
        handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
        clinvar_records = {}
        tree = ET.parse(handle)
        root = tree.getroot()

This approach is working. However I have two main drawbacks:

entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
```
 #first snippet's first lines go here..
 entrez_uids = entrez_query.get_result().uids
 current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
 self.all_entrez_uids_gsvar_file += current_entrez_uids
```

Does anyone have suggestion(s) on how to address these two presented drawbacks?

entrezpy ClinVar Python3 BioPython Entrez • 3.4k views

ADD COMMENT • link 4.4 years ago by Damianos P. Melidis ▴ 40

0

Entering edit mode

Posted at StackOverflow.

ADD REPLY • link 4.3 years ago by zx8754 12k

0

Entering edit mode

Yes, before three days I posted also in stackoverflow in hope to get as much input/feedback as I could.

ADD REPLY • link 4.3 years ago by Damianos P. Melidis ▴ 40

0

Entering edit mode

Why do you combine entrezpy and Biopyton? Entrezpy is (according to documentation) well equipped with chaining commands (so esearch | efetch) should be piece of cake (https://entrezpy.readthedocs.io/en/master/tutorials/conduit/pipeline.html). No need to fetch UIDs and download them separately.

Could you provide an example of e.g. 2 genomic pos, for which the returned UIDs overlap?

ADD REPLY • link 4.3 years ago by massa.kassa.sc3na ▴ 630

0

Entering edit mode

Thank you for the help. So I combine them as Entrezpy is developed by (mostly) one person compared to Biopython that is a community project, so the development is much more stable. And so I would prefer to use tools that will continue to develop.

With this being said, I could not find any other way to get UIDs so I used Entrezpy. However, I have explained Entrezpy returns UIDs for a batch of genomic positions. So, if you want the UIDs for two genomic positions, it will return the UIDs for both position (this the issue that I am discussing in my post - no UIDs overlap for different genomic pos). (please see the second bullet of the last part of my post)

I will check this pipeline page, but running Entrezpy I also gives a very large log files that makes my current log file not practically usable, so this is why I was thinking to use any alternative to Entrezpy and so I made the post.

ADD REPLY • link 4.3 years ago by Damianos P. Melidis ▴ 40

1

Entering edit mode

Hi, as far as I know, the Biopython Entrez and entrezpy use the same ncbi api (https://www.ncbi.nlm.nih.gov/books/NBK25500/), so I would choose one of them. It seems to me that the entrezpyis implementing a lot of Entrez Direct behaviour (and that is much easier to use then Biopython entrez).

The UIDs should be available from the biopython entrez as well (but you would need to parse the output yourself).

ad overlapping uids) Sorry, I've misunderstood your question. Multiple things could happen (result stored in entrezpy object, usage of entrez history server, ..) - minimal example would be needed.

I'm not exactly sure what exactly you mean by "log files" produced by entrezpy. (Are these output of logging module, or some result files?) I briefly checked the entrezpy sources and it seems there is hardcoded logging.DEBUG in the esearcher.py. If you set the logging level to warn(less output) and/or disable the logging to file (no output file will be produced).

ADD REPLY • link 4.3 years ago by massa.kassa.sc3na ▴ 630

0

Entering edit mode

Thank you for your reply.

I see so I will need to think which of the two libraries to use.

overlapping uids) yes the entrezpy library keeps all fetched uids in history, so I need to always keep the new ones. My concern was on when that list will become too long. entrezpy submodule, I use the esearch, they do have the parameter to exclude the UIDs from history. I did set the flag on False but still I got all UID from history, in code:

entrez_query = es.inquire(
        {'db': 'clinvar',
         'term': genomic_pos,
         'retmax': 100000,
         'retstart': 0,
         'rettype': 'uilist'})  # 'usehistory': False

log files) Sure, I mean the log file produced on running the code. I did try to disable the logging when calling the entrezpy.esearch.esearcher.Esearcher(). Also I have changed the logging level in order to have entrezpy level in lower ranking than my code logging, but still I got the entrezpy to write on the log file. I believe it's a current issue for the library as there is an issue on logging in the entrezpy git: entrezpy_a_logging_issue

ADD REPLY • link 4.3 years ago by Damianos P. Melidis ▴ 40