I have a list of genes, I would like to get the sequences of these genes as well as the start and end positions of the genes on the genome using python, I have tried using Biopython but could only get the sequences for the mRNA transcripts of the genes.
The fastest way I have found is extracting those features directly from a genbank file containing the genes of interest:
#!/usr/bin/env python
from Bio import SeqIO
def list_of_genes():
"""
Return genes list from file
"""
genes = []
with open("./your_gene_list", 'r') as fi:
for line in fi:
line = line.strip().split()
genes.append(line[0])
return genes
def extract_genes():
"""
Given a genbank file it extract the coordinates of genes:
name \t sequence \t start \t end
"""
genes = list_of_genes()
fo = open('./test.txt', 'w')
for rec in SeqIO.parse("./your_gbk_file.gbk", "genbank"):
if rec.features:
for feature in rec.features:
if feature.type == "gene":
if feature.qualifiers["locus_tag"][0] in genes:
seq = feature.extract(rec.seq)
fo.write(feature.qualifiers["locus_tag"][0]+'\t'str(seq)+'\t'+str(feature.location.start.position)+'\t'+str(feature.location.end.position)+'\n')
if __name__ == '__main__':
extract_genes()
There's a biomart python module, you might just use it since all of this is available from biomart.
N.B., I have never used that module and have no clue how well it works, thus why this is a comment rather than an answer.
I would suggest you to include the header of such a list if genes. This will facilitate the answer
Hi Antonio,
The headers for the gene list are just the Entrez Gene ID and gene symbol,
I have figured out how to get the sequence for the gene but now I need to get the start and end locations on the genome for these sequences