Question

Import All Protein Sequences In A Chromosome

0

Entering edit mode

13.3 years ago

User 3206 • 0

Hi guys I have managed to write up the code below in python that accesses a file with protein ids and import their sequences from genbank. I now wanted to write one that would import all the protein sequences in a given chromosome since I don't have all their ids. i.e number. to import the entire protein sequences given the chromosome number.

Any suggestions would be appreciated!

Thank you

from numpy import * z=genfromtxt('C:\Users\Mohammed\Desktop\ProteinIDs.txt', dtype='S12', delimiter=',', usecols=[0],unpack=True) exit

for i in range (500):

prot= '"%s"' %((z)[i])

print prot

from Bio import Entrez , SeqIO

Entrez.email = 'me@uga.edu'

handle = Entrez.efetch(db="protein", id="prot", rettype="fasta",retmode="text")

record = SeqIO.read(handle,"fasta")

String=str(record)

f= open('C:\Users\Mohammed\Desktop\protein_seqs\%s.txt' % (z)[i], 'w') for i in range (1): SeqIO.write(record, f, "fasta") print record f.close()

biopython • 3.0k views

ADD COMMENT • link updated 13.3 years ago by Alex ★ 1.5k • written 13.3 years ago by User 3206 • 0

0

Entering edit mode

Can you indent your code? Put 4 spaces in front of it (this will make it a code block), and make sure that the for loops are indented correctly. This makes it easier for us to evaluate your code.

ADD REPLY • link 13.3 years ago by Niek De Klein ★ 2.6k

score 2 · Answer 1 · 2012-03-12

Take a look at the following code that sorts proteins by chromosome, the list of all proteins you can get for example from NCBI's ftp site.

from Bio import Entrez , SeqIO

protein_ids = [61677879, 61677880, 61677881, 60637879, 60637579]

Entrez.email = 'me@uga.edu'

chr_to_proteins = {"unknown":[]}

for pid in protein_ids:

    handle = Entrez.efetch(db="protein", id=pid, rettype="gb", retmode="text")
    record = SeqIO.read(handle, "gb")
    feature_found = False
    for feature in record.features:
        if hasattr(feature, "qualifiers"):
            if "chromosome" in feature.qualifiers:
                for chr in feature.qualifiers["chromosome"]:
                    if not feature_found:
                        chr_to_proteins.setdefault(chr, [])
                        chr_to_proteins[chr].append(record)
                        feature_found = True

    if not feature_found:
        chr_to_proteins["unknown"].append(record)

for chr, proteins in chr_to_proteins.items():
    print chr, len(proteins)