I am taking an approach but have not been able to get my expected output. I have two fasta files as below - what I do is I make reverse-complement of 5mers of each record in file1
and then I look for any record in file2
if has or does not have any of reverse-complement 5mers.
file1.fa
>seq1
ACCGGT
>seq2
AATTTC
file2.fa
>seq3
CCGGT
>seq9
GGGGGGCCC
So based on the dummy examples above - the reverse-complement of ACCGG
in seq1
(file1)
exists in the seq3
of file2.fa (the CCGGT)
- So then the seq1
should be marked against the seq3
in the output (like having a counter - I am not sure how to do it).
At the end I would like to have a table as below showing every seq
in file1 against every seq
in file2 with a mark representing if they hetero-dimerized or not. Based on my example only seq1 from file1
hetero-dimerizes with the seq3 in the file2
.
file2 seq3 seq9
file1
seq1 1 0
seq2 0 0
This is the code that I have tried so far - it does not give expected output - I also could not set the script in a way to write output in table format - any help to fix the code is appreciated.
Thanks
with open('file1.fa', 'r') as fw, open('file2.fa', 'r') as rv:
for fw_record in SeqIO.parse(fw, 'fasta'):
fcnt = 0
lst = []
for i in range(len(fw_record.seq)):
kmer = str(fw_record.seq[i:i + 5])
if len(kmer) == 5:
C_kmer = Seq(kmer).complement()
lst.append(C_kmer[::-1])
#print(lst)
for rv_record in SeqIO.parse(rv, 'fasta'):
rcnt = 0
if any(items in rv_record.seq for items in lst):
fcnt+=1
rcnt+=1
if fcnt == 0 and rcnt == 0:
print('>>' + fw_record.id)
print('>>' + fw_record.seq)
if fcnt > 0 and rcnt > 0:
print('>>>' + fw_record.id)
print('>>>' + fw_record.seq)
Does the order of sequences matter here? i.e. are you only looking for the reverse-complement of
File 1, Seq 1
inFile 2, Seq 1
(numbered sequentially, not by your IDs here).Or are you looking for the R.C. of
Seq1
wherever it occurs inFile 2
?Thank you for your reply: the order of sequences does not matter - I make a reverse complement of 5mers of each record in file1 and look for them in all records of file2. If any exist then both of the records (from file1 and file2) should be marked - it is not like seq1 from file1 should be checked only with the seq1 from file2. The reverse complement 5mers coming from every record in file1 should be checked against all records of file2