Question

Motif search in multi fasta file

0

Entering edit mode

2.0 years ago

andrea • 0

I am having trouble trying to search for motifs in a multi fasta file. I have used two techniques one gives me the name of the sequence where the motif is found but doesn't give me the the motif and its position. The other method does not return anything. Please see the code below, any assistance will be highly appreciated. The first one is:

infile=open("sequence.fasta",'r')
out=open("Result.csv",'w')
pattern=re.compile(r"(P[A-Z]{2}P")
for line in infile:
   line = line.strip("\n")
   if line.startswith('>'):
   name=line
else:
     s=re.finditer(pattern,line)
     print('%s:%s' %(name,s))
 out.write('%s:\t%s\n' %(name,s))

This one above returns, the output below. It gives me the name of the sequence which is P1, but it does not give me (1) The motif found (2) the position of the motif. And the Results.csv file that it generates is blank.

#output
>P1:<callable_iterator object at 0x000001D0611B7AF0

I then tried a different technique:

infile=open("sequence.fasta",'r')
open=open("result.csv"",'w')
pattern=re.compile(r"(P[A-Z]{2}P")
for line in infile:
  line=line.strip("\n")
  if line.startswith('>'):
     name=line
else:
     s = re.finditer(pattern,line)
     for match_obj in s:
     print(match_obj)
     print(match_obj.group())
    out.write('%s:\t%s\t%s\n' %(name, match_obj,match_obj.group()))

The one above does not return anything at all. the Results.csv file is also blank. I'm still new in this python, and some of the techniques I used here I found on bioinformatics stackexchange.

The output I desire, is where I can get (1) Name of sequence, (2) Motif found and (3) position of motif (where it starts and ends) as shown in the example below in the context of every sequence in the multi fasta file

>P1
PACP
22:26
>P2
PDCP
34:38

Any format is fine as long as I can get something similar to the format above

motif • 1.2k views

ADD COMMENT • link updated 2.0 years ago by iraun 6.2k • written 2.0 years ago by andrea • 0

score 2 · Accepted Answer · 2022-11-14

2

Entering edit mode

2.0 years ago

iraun 6.2k

Hi!

First I would strongly recommend you to use biopython to work with fasta files.

Then, will something like this work for you?

  from Bio import SeqIO
  input_file = 'sequence.fasta'
  fasta_sequences = SeqIO.parse(open(input_file),'fasta')
  for fasta in fasta_sequences:
     name, sequence = fasta.id, str(fasta.seq)
     matches=re.finditer(pattern,sequence)
     for m in matches:
        print(name, m.start(), m.end())

ADD COMMENT • link 2.0 years ago by iraun 6.2k

0

Entering edit mode

Thank you, I will give this a try.

ADD REPLY • link 2.0 years ago by andrea • 0

0

Entering edit mode

It worked thank you so much

ADD REPLY • link 2.0 years ago by andrea • 0

0

Entering edit mode

Good :). Please consider marking the answer as "Accepted".

ADD REPLY • link 2.0 years ago by iraun 6.2k