Hi,
I have a pipeline where I would previously go on ensembl biomart and retrieve data but would now like to automatize a bit more and get my data from biomart directly in a Python script using the Bioservices package. My initial input is a list of RefSeq mRNA identifiers and then I use biomart to retrieve the corresponding ensembl gene identifiers. This works fine. Then I want to give these ensembl identifiers to biomart and to get the corresponding CDS sequences, accompanied by the ensembl gene ID, the ensembl transcript ID and the chromosomal start positions of exons*. At this point, I expect to get back a bit less than 2000 sequences (because this is what I get when I do this manually through my browser). In actuality, I only get 8 sequences and they are accompanied not by ensembl IDs but rather by LRG IDs! The sequences I do get look about right though. If anybody had any idea what was going on here, I would really appreciate your help. I am using Python3.4 on Mac OS X 10.9.5. Here is my code:
input_file_name = "single_exon_pcgenes_refseq2.txt"
from bioservices import *
import re
e = ensembl.Ensembl()
bm = BioMart()
bm.datasets("ensembl")
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
refseq_ids_file = open(input_file_name)
refseq_ids = refseq_ids_file.readlines()
refseq_ids_file.close()
refseq_ids = [re.sub("\n","",i) for I in refseq_ids]
refseq_ids = ",".join(refseq_ids)
bm.add_filter_to_xml("refseq_mrna", refseq_ids)
bm.add_attribute_to_xml("ensembl_gene_id")
xml_query = bm.get_xml()
ens_ids_raw = bm.query(xml_query)
bm.new_query()
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)
bm.add_attribute_to_xml("ensembl_gene_id")
bm.add_attribute_to_xml("ensembl_transcript_id")
bm.add_attribute_to_xml("exon_chrom_start")
bm.add_attribute_to_xml("coding")
xml_query_seq = bm.get_xml()
raw_sequences = bm.query(xml_query_seq)
This is one of the four sequences contained in raw_sequences
at the moment (I deleted the middle part of the sequence to make it easier to read and added newlines):
ATGGAGGGGATCAGTATATACACTTCAGATAACTACACCGAGGAAATGGGCTCAGGGGACTATGACTCCAT
AACCTCTACAGCAGTGTCCTCATCCTGGCCTTCATCAGTCTGGACCGCTACCTGGCCATCGTCCACGCCA
GAGTCTTCAAGTTTTCACTCCAGCTAA LRG_51 LRG_51t1 5001;7244
*This might seem like a strange way of doing things but I can't ask for the sequences directly in the first step using the RefSeq IDs as input because this would only give me the sequences of the particular transcripts whereas I need the sequences of all the transcripts from each corresponding gene. But I need to filter them first to make sure that each of those genes has at least one single-exon transcript, which is why I need to go through the transcript IDs in the first place (I get those from UCSC).
Kudos to you for phrasing the question so well!