Question

Python Bioservices doesn't retrieve the right data from biomart

3

Entering edit mode

10.5 years ago

rs ▴ 40

Hi,

I have a pipeline where I would previously go on ensembl biomart and retrieve data but would now like to automatize a bit more and get my data from biomart directly in a Python script using the Bioservices package. My initial input is a list of RefSeq mRNA identifiers and then I use biomart to retrieve the corresponding ensembl gene identifiers. This works fine. Then I want to give these ensembl identifiers to biomart and to get the corresponding CDS sequences, accompanied by the ensembl gene ID, the ensembl transcript ID and the chromosomal start positions of exons*. At this point, I expect to get back a bit less than 2000 sequences (because this is what I get when I do this manually through my browser). In actuality, I only get 8 sequences and they are accompanied not by ensembl IDs but rather by LRG IDs! The sequences I do get look about right though. If anybody had any idea what was going on here, I would really appreciate your help. I am using Python3.4 on Mac OS X 10.9.5. Here is my code:

input_file_name = "single_exon_pcgenes_refseq2.txt"

from bioservices import *
import re
e = ensembl.Ensembl()

bm = BioMart()
bm.datasets("ensembl")
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
refseq_ids_file = open(input_file_name)
refseq_ids = refseq_ids_file.readlines()
refseq_ids_file.close()
refseq_ids = [re.sub("\n","",i) for I in refseq_ids]
refseq_ids = ",".join(refseq_ids)
bm.add_filter_to_xml("refseq_mrna", refseq_ids)
bm.add_attribute_to_xml("ensembl_gene_id")
xml_query = bm.get_xml()
ens_ids_raw = bm.query(xml_query)

bm.new_query()
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)
bm.add_attribute_to_xml("ensembl_gene_id")
bm.add_attribute_to_xml("ensembl_transcript_id")
bm.add_attribute_to_xml("exon_chrom_start")
bm.add_attribute_to_xml("coding")
xml_query_seq = bm.get_xml()
raw_sequences = bm.query(xml_query_seq)

This is one of the four sequences contained in raw_sequences at the moment (I deleted the middle part of the sequence to make it easier to read and added newlines):

ATGGAGGGGATCAGTATATACACTTCAGATAACTACACCGAGGAAATGGGCTCAGGGGACTATGACTCCAT
AACCTCTACAGCAGTGTCCTCATCCTGGCCTTCATCAGTCTGGACCGCTACCTGGCCATCGTCCACGCCA
GAGTCTTCAAGTTTTCACTCCAGCTAA    LRG_51    LRG_51t1    5001;7244

*This might seem like a strange way of doing things but I can't ask for the sequences directly in the first step using the RefSeq IDs as input because this would only give me the sequences of the particular transcripts whereas I need the sequences of all the transcripts from each corresponding gene. But I need to filter them first to make sure that each of those genes has at least one single-exon transcript, which is why I need to go through the transcript IDs in the first place (I get those from UCSC).

python biomart bioservices ensembl • 3.5k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by rs ▴ 40

0

Entering edit mode

Kudos to you for phrasing the question so well!

ADD REPLY • link 3.3 years ago by Ram 45k

Ram · Accepted Answer · 2015-02-10

2

Entering edit mode

10.5 years ago

rs ▴ 40

I got it!

Sorry, I should have kept on playing with it for a few more hours before turning to you guys but it was starting to look a bit grim there a while ago.

It turns out I was simply using the wrong filter name. So

bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)

should have been

bm.add_filter_to_xml("ensembl_gene_id, ens_ids_raw)

Sorry, that was a dumb one!

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by rs ▴ 40

1

Entering edit mode

This is like when we find typos right after hitting Send on an email. Happens to all of us :)

ADD REPLY • link 3.3 years ago by Ram 45k