Question

Hetero-dimerization analysis with two FASTA files

0

Entering edit mode

3.7 years ago

Apex92 ▴ 300

I made a post previously (How to loop on two fasta files for hetero-dimerization analysis) but I think I could not explain my problem clearly.

Let's assume I have two fast files as below:

file1.fa

>seq20
NNNNNNN
>seq11
NNACCGGNN
>seq2
NNNNNNNN

file2.fa

>seq3
AAACCGGTAAA
>seq9
GGGGGGGGGGG

I want to make a loop using two files - for each sequence in file1 the loop should go through all the sequences in file2.

What I want to do is to start with the first sequence in file1 - make 5mers - save the 5mer's reverse complements in a list and look for any item in the list if exists in ANY sequence in file2. Then loop goes to the second sequence in file2 and makes a NEW list (not adding up to the previous list).

Based on the dummy sequences - one of the possible 5mers from the seq11 in the file1.fa is ACCGG which it's reverse complement becomes (CCGGT) - if you look at the seq3 in the file2 you could see that CCGGT exists in that sequence then these sequence ids should be reported.

Base on my examples only seq11 hetero-dimerizes with seq3 so the desired output should look like a table as below:

     file2     seq3     seq9
file1 
seq20          0         0
seq11          1         0
seq2           0         0

I have written this script but it does not give any output - also I have not been able to set the rest of the script to make the output in a table format. Any help is appreciated - Many thanks.

from Bio import SeqIO

with open('test1.fa', 'r') as fw, open('test2.fa', 'r') as rv:
    for fw_record in SeqIO.parse(fw, 'fasta'):
        lst = []
        for i in range(len(fw_record.seq)):
            kmer = str(fw_record.seq[i:i + 5])
            if len(kmer) == 5:
                C_kmer = Seq(kmer).complement()
                lst.append(C_kmer[::-1])
        print(lst)

        for rv_record in SeqIO.parse(rv, 'fasta'):
            if any(items in rv_record.seq for items in lst):
                print(fw_record.id)
                print(fw_record.seq)
                print(rv_record.id)
                print(rv_record.seq)

sequence FASTA python awk bash • 867 views

ADD COMMENT • link updated 3.6 years ago by Ram 44k • written 3.7 years ago by Apex92 ▴ 300

0

Entering edit mode

Please stop asking the same question over and over again.

ADD REPLY • link 3.6 years ago by Istvan Albert 102k

score 0 · Answer 1 · 2021-04-14

0

Entering edit mode

3.6 years ago

cpad0112 21k

This issue is addressed in How to get Python code to run correctly for hetero-dimerization analysis.

ADD COMMENT • link 3.6 years ago by cpad0112 21k