Hey, I have a problem. I have names of proteins for example lpg_200 etc. How can I get FASTA seqences for them?
Regards
X
Hey, I have a problem. I have names of proteins for example lpg_200 etc. How can I get FASTA seqences for them?
Regards
X
Go to ftp://ftp.ncbi.nih.gov/refseq/ and download the corresponding data base (human? mouse? etc) and then extract them with a simple python script:
from Bio import SeqIO
import sys
syntax = '''
------------------------------------------------------------------------------------
Syntax: python extract_sequence_by_name_list.py *file1.fasta *file2.txt
*Sequences in fasta format
**List of sequences to extract; must have the same name as in fasta file without '>'
------------------------------------------------------------------------------------
'''
if len(sys.argv) != 3:
print syntax
sys.exit()
from Bio import SeqIO
import sys
wanted = [line.strip() for line in open(sys.argv[2])]
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
It's Legionella proteins.
Iwrote this ;/ but somethings wrong...
#!/bin/bash
#download fasta seqs given file of uniprot ids
names=$1
file_of_seqs=$2
list=$(cat ${1})
mkdir ${file_of_seqs}
cp ${2} ${file_of_seqs}
cd ${file_of_seqs}
for word in ${list}
do
wget -nv http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=$word&fil=&limit=10&force=no&preview=true&format=fasta
done
EXAMPLE INPUT:
Lpar_2881 Lpar_2978 Lpar_3608 lpg0403
i try now in Python (trying to learn perl now)
XD, is not complicated, you have to do just what I said;
1.- go to the FTP page from NCBI.
2.- download you data base
3.- Copy and paste my script on any text editor program, save it as; extract_sequence_by_name_list.py (or python program, I use note pad ++ text editor to do that).
4.- Save your list of wanted proteins on different txt file as a list. (make sure that they have the same name as in the fasta database).
5.- Run on bash as; python extract_sequence_by_name_list.py database.fasta wanted.txt > wanted_proteins.fasta
6.- Be happy :)
You can use the UniProt IDmapping service at http://www.uniprot.org/uploadlists Upload your list of identifiers and select to map from Gene names to UniProtKB ACs. The results can be downloaded in tab-separated format.
Alternatively use URLs like http://www.uniprot.org/uniprot/?query=gene%3ALpar_2881&format=fasta in your program.
If you are using UniProtKB, you can of course add additional search criteria to avoid duplication, e.g. the taxonomy identifier:
gene:Lpar_2881 and organism:45071
For this particular organism, there are only unreviewed entries, but in other cases there may be reviewed and unreviewed ones, in which case it can be useful to also add reviewed:yes in case of redundancy/duplication.
An alternative approach may be to generate a list of all Legionella parisiensis entries with their ORFnames, and then look up your identifiers locally in this list:
http://www.uniprot.org/uniprot/?query=organism:45071
Customize your display, remove all irrelevant columns and add one for 'Gene name (ORFname)' as described in http://www.uniprot.org/help/customize :
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Elaborate more please.
i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.
i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.
Please post a few real examples of ID's. Database identifiers can differ from db to db and depending what kind you have the answer may be different.
Also use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.