Extact specific sequence from a fasta file into multiple files using ID and EC from a different file

0

Entering edit mode

8.4 years ago

trini1god • 0

I have two files: The first file is a .txt file that contains 3 columns but i am making use of column 2 and 3. The second file is a .fasta file that contain the sequences. Using python, I want to make each column 3 the file name and based on it compare the IDs in file1 and file2 and then use Biopython to write the sequences to the file I made(column 3)

file_1.txt:
009L_FRG3G  **Q6GZW6    3.6.4.-**
019R_FRG3G  **Q6GZV6    2.7.11.1**
044L_IIV3           Q197B6           2.7.11.-
055L_FRG3G  **Q6GZS1    3.6.4.-**
080R_IIV3       Q196Y0  3.6.1.-
088R_FRG3G   Q6GZN7         1.8.3.2
095L_IIV3   Q196W5           3.4.24.- ...

file_2.fasta

>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp|    Q197B6|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
>sp|**Q6GZV6**|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
...

My expected Output: Is to have a multiple files with the sequences that have the same EC

3.6.4.-.fasta

>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3 
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
**2.7.11.1.fasta**
>sp|Q6GZV6|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
ETC...

my problem is slightly complex to this solution: https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file which only outputs in a single file.

My code so far:

`#!/usr/bin/python3
import os
from Bio import SeqIO

def get_accession(record):
        """given a seq_record, return the accession number as a         string $
        """
        parts = record.id.split("|")
        assert len(parts) == 3 and parts[0] == "sp"
        return parts[1]
records_dict = SeqIO.to_dict(SeqIO.parse("file_2", "fasta"), key_function=get_accession)

#intailize a dictionary
answer = {}

with open('file_1', 'r') as content:
#extracts AC1, EC from ID_AC.txt and makes it a dictionary
    for line in content:
        lines = line.split()
        answer[lines[1]] = lines[2]
#does the comparism and writing to the file here
records = SeqIO.parse("file_2.fasta", "fasta")
for seq_record in records:
    for key in records_dict:       #satisfies the condition that all key in file1 is in file2
        if key in answer: 
            EC = answer[key]
            eachEC = "".join(eachEC for eachEC in EC if eachEC.isalnum() or eachEC in ['','.', '-']).rstrip() + ".fasta"   #converts eachEC into a file name
            mode = 'a' if os.path.exists(eachEC) else 'w'
            if eachEC:
                with open(eachEC, mode) as fileinput:
                    fileinput.write(seq_record.format("fasta").strip())
                    fileinput.write(str(seq_record.seq) + "\n")`

Problem The problem with my script is that is creates the multiple files but copies the whole sequence in file_2 into them. Thanks. Am new to python

python biopython sequence • 4.1k views

ADD COMMENT • link updated 8.4 years ago by Matt Shirley 10k • written 8.4 years ago by trini1god • 0

0

Entering edit mode

You want to put identical col 2 ids sequences from your fasta to a new fasta named by column 3? Your file2 fasta is also missing '>' for headers. If you can better format your question, we can help.

Please edit your code with the 101010 button.

ADD REPLY • link 8.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

8.4 years ago

st.ph.n ★ 2.7k

Linearize your fasta from multi-line to single-line. I almost always do this to make parsing easier. This is my proposed solution, from what I understood from your question.

#!/usr/bin/env python
from collections import defaultdict

#open file1, using col3 as ids and col2 appending to defaultdict values
with open('file_1.txt', 'r') as f:
    ids = defaultdict(list)   
    for line in f:
        ids[line.strip().split('\t')[1]].append(line.strip().split('\t')[2])

# open file2 fasta, put sequences into dictionary with col2 ids as keys
with open('file_2.fasta', 'r') as fasta:
    seqs = {}
    for line in fasta:
        if line.startswith('>'):
            seqs[line.strip().split('|')[1]] = next(fasta).strip()

# for each id (used as new filename), print out matching col2 ids and sequences to new file 
for i in ids:
    with open(i + '.fasta', 'w') as out:
        for n in range(len(ids[i])):
            out.write('>' + ids[i][n] + '\n' + seqs[ids[i][n]])

ADD COMMENT • link 8.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks for your response, but your code doesn't work for me. it gives a KeyError message. can you please explain a little bit the last part of the code? When I printed out the "seqs" part, I found out that the sequences were not complete, is it something you can solve? Please can you use biopython to write to the file?

ADD REPLY • link 8.4 years ago by trini1god • 0

0

Entering edit mode

Can you post the error message?

Make sure file_1.txt is tab-delimited, and doesn't have spaces.

When I printed out the "seqs" part, I found out that the sequences were not complete

Did you linearize your fasta?

Is this homework? If so, and you were instructed to use Biopython, you need to look at the SeqIO module.

ADD REPLY • link 8.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

8.4 years ago

Matt Shirley 10k

If you really want to use Biopython feel free, but here's a solution using pyfaidx, which saves you the trouble of reading the entire file.

	from pyfaidx import Fasta

	with Fasta("file_2.fasta") as records:
	with open("file_1") as content:
	for line in content:
	_, ec, filename = line.rstrip().split()
	with open(filename, "w") as out_file:
	for record in records:
	if ec in record.name:
	out_file.write(repr(record))

view raw answer.py hosted with ❤ by GitHub

ADD COMMENT • link 8.4 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks @Matt Shirley for modifying my question. And thank you for your answer but i really want to use biopython it was requested for the job.

ADD REPLY • link 8.4 years ago by trini1god • 0

Login before adding your answer.