I have two files: The first file is a .txt file that contains 3 columns but i am making use of column 2 and 3. The second file is a .fasta file that contain the sequences. Using python, I want to make each column 3 the file name and based on it compare the IDs in file1 and file2 and then use Biopython to write the sequences to the file I made(column 3)
file_1.txt:
009L_FRG3G **Q6GZW6 3.6.4.-**
019R_FRG3G **Q6GZV6 2.7.11.1**
044L_IIV3 Q197B6 2.7.11.-
055L_FRG3G **Q6GZS1 3.6.4.-**
080R_IIV3 Q196Y0 3.6.1.-
088R_FRG3G Q6GZN7 1.8.3.2
095L_IIV3 Q196W5 3.4.24.- ...
file_2.fasta
>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp| Q197B6|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
>sp|**Q6GZV6**|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
...
My expected Output: Is to have a multiple files with the sequences that have the same EC
3.6.4.-.fasta
>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
**2.7.11.1.fasta**
>sp|Q6GZV6|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
ETC...
my problem is slightly complex to this solution: https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file which only outputs in a single file.
My code so far:
`#!/usr/bin/python3
import os
from Bio import SeqIO
def get_accession(record):
"""given a seq_record, return the accession number as a string $
"""
parts = record.id.split("|")
assert len(parts) == 3 and parts[0] == "sp"
return parts[1]
records_dict = SeqIO.to_dict(SeqIO.parse("file_2", "fasta"), key_function=get_accession)
#intailize a dictionary
answer = {}
with open('file_1', 'r') as content:
#extracts AC1, EC from ID_AC.txt and makes it a dictionary
for line in content:
lines = line.split()
answer[lines[1]] = lines[2]
#does the comparism and writing to the file here
records = SeqIO.parse("file_2.fasta", "fasta")
for seq_record in records:
for key in records_dict: #satisfies the condition that all key in file1 is in file2
if key in answer:
EC = answer[key]
eachEC = "".join(eachEC for eachEC in EC if eachEC.isalnum() or eachEC in ['','.', '-']).rstrip() + ".fasta" #converts eachEC into a file name
mode = 'a' if os.path.exists(eachEC) else 'w'
if eachEC:
with open(eachEC, mode) as fileinput:
fileinput.write(seq_record.format("fasta").strip())
fileinput.write(str(seq_record.seq) + "\n")`
Problem The problem with my script is that is creates the multiple files but copies the whole sequence in file_2 into them. Thanks. Am new to python
You want to put identical col 2 ids sequences from your fasta to a new fasta named by column 3? Your file2 fasta is also missing '>' for headers. If you can better format your question, we can help.
Please edit your code with the
101010
button.