Python Bioservices doesn't retrieve the right data from biomart
1
3
Entering edit mode
9.9 years ago
rs ▴ 40

Hi,

I have a pipeline where I would previously go on ensembl biomart and retrieve data but would now like to automatize a bit more and get my data from biomart directly in a Python script using the Bioservices package. My initial input is a list of RefSeq mRNA identifiers and then I use biomart to retrieve the corresponding ensembl gene identifiers. This works fine. Then I want to give these ensembl identifiers to biomart and to get the corresponding CDS sequences, accompanied by the ensembl gene ID, the ensembl transcript ID and the chromosomal start positions of exons*. At this point, I expect to get back a bit less than 2000 sequences (because this is what I get when I do this manually through my browser). In actuality, I only get 8 sequences and they are accompanied not by ensembl IDs but rather by LRG IDs! The sequences I do get look about right though. If anybody had any idea what was going on here, I would really appreciate your help. I am using Python3.4 on Mac OS X 10.9.5. Here is my code:

input_file_name = "single_exon_pcgenes_refseq2.txt"

from bioservices import *
import re
e = ensembl.Ensembl()

bm = BioMart()
bm.datasets("ensembl")
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
refseq_ids_file = open(input_file_name)
refseq_ids = refseq_ids_file.readlines()
refseq_ids_file.close()
refseq_ids = [re.sub("\n","",i) for I in refseq_ids]
refseq_ids = ",".join(refseq_ids)
bm.add_filter_to_xml("refseq_mrna", refseq_ids)
bm.add_attribute_to_xml("ensembl_gene_id")
xml_query = bm.get_xml()
ens_ids_raw = bm.query(xml_query)

bm.new_query()
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)
bm.add_attribute_to_xml("ensembl_gene_id")
bm.add_attribute_to_xml("ensembl_transcript_id")
bm.add_attribute_to_xml("exon_chrom_start")
bm.add_attribute_to_xml("coding")
xml_query_seq = bm.get_xml()
raw_sequences = bm.query(xml_query_seq)

This is one of the four sequences contained in raw_sequences at the moment (I deleted the middle part of the sequence to make it easier to read and added newlines):

ATGGAGGGGATCAGTATATACACTTCAGATAACTACACCGAGGAAATGGGCTCAGGGGACTATGACTCCAT
AACCTCTACAGCAGTGTCCTCATCCTGGCCTTCATCAGTCTGGACCGCTACCTGGCCATCGTCCACGCCA
GAGTCTTCAAGTTTTCACTCCAGCTAA    LRG_51    LRG_51t1    5001;7244

*This might seem like a strange way of doing things but I can't ask for the sequences directly in the first step using the RefSeq IDs as input because this would only give me the sequences of the particular transcripts whereas I need the sequences of all the transcripts from each corresponding gene. But I need to filter them first to make sure that each of those genes has at least one single-exon transcript, which is why I need to go through the transcript IDs in the first place (I get those from UCSC).

python biomart bioservices ensembl • 3.2k views
ADD COMMENT
0
Entering edit mode

Kudos to you for phrasing the question so well!

ADD REPLY
2
Entering edit mode
9.9 years ago
rs ▴ 40

I got it!

Sorry, I should have kept on playing with it for a few more hours before turning to you guys but it was starting to look a bit grim there a while ago.

It turns out I was simply using the wrong filter name. So

bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)

should have been

bm.add_filter_to_xml("ensembl_gene_id, ens_ids_raw)

Sorry, that was a dumb one!

ADD COMMENT
1
Entering edit mode

This is like when we find typos right after hitting Send on an email. Happens to all of us :)

ADD REPLY

Login before adding your answer.

Traffic: 1035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6