Hi - I am using biopython (open to any other method however) to find sequences in a fasta file that match ID's from a .txt file. Just by looking at the two files I can see there are matches, but the following script will not find them:
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
here is a snippet of the fasta file (.fasta):
>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1
MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI
TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE
RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR
>gnl|TC-DB|P06550 1.A.1.1.2 LctB protein - Bacillus stearothermophilus.
MDYAFLGVVTAVLLGSITSLWTASVQATHRLSLDSLWVLVQWYGTMLLGFAMIYMILQAN
GHHVFTPSPSSAAGNRLSMLEDSLYLSGMTLLSVGYGDVTPVGIGRWIAIAEALLGYIMP
AVIVTRTVFDSDRR
and the id file (.txt)
Q09666
Q7PMK5
P98161
A0CX44
Q23K98
P29993
Q9VLS3
Q14643
Q9H5I5
Q4E330
P29995
I don't think there is a problem with the python script, but maybe the fasta file. can anyone spot anything strange. I know for a fact all id's match.
Thanks very much,
Arron.
thanks chris, yes the website i downloaded the fast files from (TCDB) has included this in the id which is annoying. thanks for pointing this out.