Entering edit mode
7.4 years ago
David
▴
240
Hi, I would like to parse a fasta file and get all headers and seqs that match some strings (so called pattern below).
What happens is that all the headers match and the script returns all sequences so it´s not working as expected.
Can you help finding what´s wrong ??
# biopython
from Bio import SeqIO
# regex library
import re
# file with FASTA sequence
infile = "seq.fa"
# File looks like this
#>1238344 mouse
#aagctacgacatcagctaca
#>1238344 homo sapiens
#ttagcatcagcatcagctacta
# pattern to search for
#pattern = "sapiens|mouse"
pattern="Eukaryota|metagenome|[Homo Sapiens]|[Mus musculus]|[Rattus norvegicus]|Rhizobium|Gorilla|beringei|thaliana|[Oryza sativa]|Dictyostelium|mitochondria|Equus caballus|Plasmod\
ium falciparum|Drosophila melanogaster"
# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
matches = re.findall(pattern, seq_record.description, re.I)
if matches:
#print "Matches = ", len(matches),"\n",
print ">",seq_record.id," ",seq_record.description, "\n",
print seq_record.seq,"\n",
Thanks
Providing some examples would help. And your code snippet is sketchy at best.
Yes, there are definitely important parts missing from that code.
Sorry guys my copy and paste didn´t work as expected. Hope it´s better now.
if matches
in this case will always evaluate toTrue
, therefore it will print everything in the input file.If i add the following sequence which is not in my pattern list (sorry guys forgot to add it). It is also returned.
Running your above script, and adding a print line to print
matches
(before theif
statement), yields this:Don't use
re.findall
in this instance. Create a list, with each string you want to search for, then for each record in the input fasta, check if it is contained in the list.Thanks, Can you give a small example, don´t really know how to do that ?