Question

How can I retrieve the sequence of a gene and the start and end genomic position using Python

0

Entering edit mode

9.8 years ago

doncheng1 • 0

I have a list of genes, I would like to get the sequences of these genes as well as the start and end positions of the genes on the genome using python, I have tried using Biopython but could only get the sequences for the mRNA transcripts of the genes.

Thanks

Sequence Genome gene Position • 7.0k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.8 years ago by doncheng1 • 0

1

Entering edit mode

There's a biomart python module, you might just use it since all of this is available from biomart.

N.B., I have never used that module and have no clue how well it works, thus why this is a comment rather than an answer.

ADD REPLY • link 9.8 years ago by Devon Ryan 105k

0

Entering edit mode

I would suggest you to include the header of such a list if genes. This will facilitate the answer

ADD REPLY • link 9.8 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Hi Antonio,

The headers for the gene list are just the Entrez Gene ID and gene symbol,

I have figured out how to get the sequence for the gene but now I need to get the start and end locations on the genome for these sequences

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.8 years ago by doncheng1 • 0

Ram · Answer 1 · 2015-09-14

The fastest way I have found is extracting those features directly from a genbank file containing the genes of interest:

#!/usr/bin/env python

from Bio import SeqIO

def list_of_genes():
    """
    Return genes list from file
    """

    genes = []
    with open("./your_gene_list", 'r') as fi:
         for line in fi:
             line = line.strip().split()
             genes.append(line[0])

    return genes

def extract_genes():
    """
    Given a genbank file it extract the coordinates of genes:
    name \t sequence \t start \t end
    """

    genes = list_of_genes()

    fo = open('./test.txt', 'w')

    for rec in SeqIO.parse("./your_gbk_file.gbk", "genbank"):
        if rec.features:
            for feature in rec.features:
                if feature.type == "gene":
                    if feature.qualifiers["locus_tag"][0] in genes:
                        seq = feature.extract(rec.seq)
                        fo.write(feature.qualifiers["locus_tag"][0]+'\t'str(seq)+'\t'+str(feature.location.start.position)+'\t'+str(feature.location.end.position)+'\n')

if __name__ == '__main__':
    extract_genes()