Master dictionary of database file (2D dictionary) using Biopython, then specify key with string
0
0
Entering edit mode
7.0 years ago

My goal is to create a 2d dictionary, search for some sequence ids, and then writes the organism name with the amino acid sequence to a file

I have a working code that creates one dictionary, and looks for the id numbers that I want, and writes it to a file. I am unable to iterate through all my files, but it only returns the id number. I was looking to return the organism name as the key.

I have seen many examples on how to parse a single file as a dictionary to retrieve a dictionary as id:seq. Another example I seen seems to turn the header into a string, then split, but I am unsuccessful. Some of my headers have commas, and some do not. The examples I seen were splitting on ("|")

>EFE00375.1 S-adenosylmethionine-dependent methyltransferase, YraL family [Lactobacillus crispatus 214-1]

My command python master_lacto_dict.py L_214.txt P_1_Results.txt P_1_Clustal.txt

import sys
from Bio import SeqIO


aa_db_file = sys.argv[1]  # Amino Acid Database file ~ 17 files
accession_id_file = sys.argv[2] # Accession IDs  file ~ 18 accession id numbers
file_for_clustal = sys.argv[3] # Output fasta file

wanted = set()
with open(accession_id_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)


fasta_database = SeqIO.parse(open(aa_db_file),'fasta')
#fasta_database = Seq.IO.index("file_name", "fasta")  Also seen this in many examples



with open(file_for_clustal, "w") as f:
    for seq in fasta_database:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

#Desired output
#crispatus 214-1:seq
Python Fasta dictionary • 2.4k views
ADD COMMENT
1
Entering edit mode

Hi, could you please specify your concrete question here (it seems there are a couple of them)?

Besides, as your question is rather of a "programming" nature, I would recommend to search for your problem on Google and especially at Stackoverflow (kind of biostars for programmers). From my experience, there is no question that has not yet been asked there, so if you can specify your concrete problem you should find a solution there.

Best,

Cindy

ADD REPLY
1
Entering edit mode

Well, if the desired information is always hidden inside these brackets and you can make sure that these are the only occurrences of those brackets, why don't you search for the occurrences of "[" and "]" to retrieve start and end index, and then get the corresponding substring of rec.description?

ADD REPLY
1
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

Hello,

I know have a for loop that seems to work

fasta_files = "L_214.txt","L_224-1.txt" # two files that are used as my database(s)
recs = [rec for f in fasta_files for rec in SeqIO.parse(f, 'fasta')]
print recs

Output:

id='EFB61523.1', name='EFB61523.1', description='EFB61523.1 cyclic nucleotide-binding domain protein, partial [Lactobacillus gasseri 224-1]', dbxrefs=[]), SeqRecord(seq=Seq('MLLVKILKNYYQAVFNKNADEVKVISDVLEKAGWKDFVSVLPHLNS', SingleLetterAlphabet())

How can I parse the recs? I want what is inside those brackets?

Lactobacillus gasseri 224-1:SeqRecord
ADD REPLY
1
Entering edit mode
import re

d = {}
fasta_files = "L_214.txt","L_224-1.txt"
for f in fasta_files:
    for rec in SeqIO.parse(f, 'fasta'):
        desc = rec.description
        if '[' in desc:
            organism = re.findall('\[([^\]]+)', desc)[0]
            d[organism] = rec
print(d)
ADD REPLY
0
Entering edit mode

the re module is very useful in this case. Now I am curious how to iterates through my results files, which is basically a handful of text files, with one column of data.

How could I iterate through these files and check my master dictionary??

EEQ68960.1
EEQ26037.1
EFE00375.1
EFB62114.1
EEQ24341.1
EEJ40899.1
EEW51706.1
ADD REPLY
0
Entering edit mode

I was thinking about the glob function, but maybe there is an easier method

path = '/Users/sueparks/' + 'P_*' + '_Results.txt'
for file in glob.glob(path):
    with open(file, 'r') as f:
        line = f.readlines()
        print line
ADD REPLY

Login before adding your answer.

Traffic: 2756 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6