I have a list of genes, I would like to get the sequences of these genes as well as the start and end positions of the genes on the genome using python, I have tried using Biopython but could only get the sequences for the mRNA transcripts of the genes.
The fastest way I have found is extracting those features directly from a genbank file containing the genes of interest:
#!/usr/bin/env pythonfrom Bio import SeqIO
deflist_of_genes():"""
Return genes list from file
"""
genes =[]withopen("./your_gene_list",'r')as fi:for line in fi:
line = line.strip().split()
genes.append(line[0])return genes
defextract_genes():"""
Given a genbank file it extract the coordinates of genes:
name \t sequence \t start \t end
"""
genes = list_of_genes()
fo =open('./test.txt','w')for rec in SeqIO.parse("./your_gbk_file.gbk","genbank"):if rec.features:for feature in rec.features:if feature.type=="gene":if feature.qualifiers["locus_tag"][0]in genes:
seq = feature.extract(rec.seq)
fo.write(feature.qualifiers["locus_tag"][0]+'\t'str(seq)+'\t'+str(feature.location.start.position)+'\t'+str(feature.location.end.position)+'\n')if __name__ =='__main__':
extract_genes()
There's a biomart python module, you might just use it since all of this is available from biomart.
N.B., I have never used that module and have no clue how well it works, thus why this is a comment rather than an answer.
I would suggest you to include the header of such a list if genes. This will facilitate the answer
Hi Antonio,
The headers for the gene list are just the Entrez Gene ID and gene symbol,
I have figured out how to get the sequence for the gene but now I need to get the start and end locations on the genome for these sequences