Find Sequences In Fasta File From List Of Id'S
3
1
Entering edit mode
11.6 years ago
arronslacey ▴ 320

Hi - I am using biopython (open to any other method however) to find sequences in a fasta file that match ID's from a .txt file. Just by looking at the two files I can see there are matches, but the following script will not find them:

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

here is a snippet of the fasta file (.fasta):

>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1
MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI
TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE
RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR
>gnl|TC-DB|P06550 1.A.1.1.2 LctB protein - Bacillus stearothermophilus.
MDYAFLGVVTAVLLGSITSLWTASVQATHRLSLDSLWVLVQWYGTMLLGFAMIYMILQAN
GHHVFTPSPSSAAGNRLSMLEDSLYLSGMTLLSVGYGDVTPVGIGRWIAIAEALLGYIMP
AVIVTRTVFDSDRR

and the id file (.txt)

Q09666 
Q7PMK5 
P98161 
A0CX44 
Q23K98 
P29993 
Q9VLS3 
Q14643 
Q9H5I5 
Q4E330 
P29995

I don't think there is a problem with the python script, but maybe the fasta file. can anyone spot anything strange. I know for a fact all id's match.

Thanks very much,

Arron.

biopython bioperl • 8.6k views
ADD COMMENT
3
Entering edit mode
11.6 years ago
Xtof ▴ 170

Hi Arron

I don't use Biopython but I think that for the first sequence of your fasta file for example r.id gives you "gnl|TC-DB|P0A334". But into the id file only the last part is written. So I think you have to split your r.id if you want it to match to the id file

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id.split('|')[2] in wanted)

Also I wonder why you split the id into the line

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))

this code should work as well

wanted = set(line.rstrip("\n") for line in open(id_file))

Hope that helps

Chris

ADD COMMENT
0
Entering edit mode

thanks chris, yes the website i downloaded the fast files from (TCDB) has included this in the id which is annoying. thanks for pointing this out.

ADD REPLY
2
Entering edit mode
11.6 years ago

r.id in the line `

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)

gives back something from the form 'gnl|TC-DB|P0A334'. So, just changing that line to

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id.split('|')[2] in wanted)

will work. I will assume that your snippet of the fasta file is badly chosen, because none of the ids is anywhere to be found in that snippet. But if you add P0A334 to your list of ids, the record will now be saved.

ADD COMMENT
0
Entering edit mode

thanks very much, worked like a charm.

ADD REPLY
0
Entering edit mode
6.8 years ago

I have a genome in fasta format and I have a sequence of length 5000bp I want to find if the sequence is present in that genome or not? If it present then what is the exact position of that sequence in genome or overlapping gene id? How to get it in python? please can anyone suggest me

ADD COMMENT

Login before adding your answer.

Traffic: 2119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6