Hi All,
I have a batch of human gene symbol name and I want to obtain the protein sequence for them. Is there any script can do that?
Thanks.
Hi All,
I have a batch of human gene symbol name and I want to obtain the protein sequence for them. Is there any script can do that?
Thanks.
You can use XML as query for Ensembl. The following example is for two human gene symbols (MT-TF and TNRC6A).
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
<Filter name = "external_gene_name" value = "MT-TF,TNRC6A"/>
<Attribute name = "peptide" />
<Attribute name = "ensembl_gene_id" />
<Attribute name = "ensembl_gene_id_version" />
<Attribute name = "ensembl_transcript_id" />
<Attribute name = "ensembl_transcript_id_version" />
<Attribute name = "external_gene_name" />
</Dataset>
</Query>
Format the above XML to one line and download the protein sequences using wget
.
wget -O results.fasta 'http://www.ensembl.org/biomart/martservice?query=<Query virtualSchemaName="default" formatter="FASTA" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6"><Dataset name="hsapiens_gene_ensembl" interface="default"><Filter name="external_gene_name" value="MT-TF,TNRC6A"/><Attribute name="peptide"/><Attribute name="ensembl_gene_id"/><Attribute name="ensembl_gene_id_version"/><Attribute name="ensembl_transcript_id"/><Attribute name="ensembl_transcript_id_version"/><Attribute name="external_gene_name"/></Dataset></Query>'
The FASTA sequences will be in the results.fasta
file.
>ENSG00000090905|ENSG00000090905.19|ENST00000568750|ENST00000568750.1|TNRC6A
XQDGIVADESQNMQFMSSQSMKLPPSNSALPNQALGSIAGLGMQNLNSVRQNGNPSMFGV
GNTAAQPRGMQQPPAQPLSSSQPNLRAQVPPPLLSPQVAMLNQLSQLNQLSQISQLQRLL
AQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQPLHQ
PAMKSFLDNVMPHTTPELQKGPS
>ENSG00000090905|ENSG00000090905.19|ENST00000450465|ENST00000450465.6|TNRC6A
QQDIVGSWGIPPATGKPPGTGWLGGPIPAPAKEEEPTGWEEPSPESIRRKMEIDDGTSAW
GDPSKYNYKNVNMWNKNVPNGNSRSDQQAQVHQLLTPASAISNKEASSGSGPKSMQDGWC
GDDMPLPGNRPTGWEEEEDVEIGMWNSNSSQELNSSLNWPPYTKKMSSKGLSGKKRRRER
GMMKGGNKQEEAWINPFVKQFSNISFSRDSPEENVQSNKMDLSGGMLQDKRMEIDKHSLN
IGDYNRTVGKGPGSRPQISKESSMERNPYFDKNGNPSMFGVGNTAAQPRGMQQPPAQPLS
SSQPNLRAQVPPPLLSPQVPVSLLKYAPNNGGLNPLFGPQQVAMLNQLSQLNQLSQISQL
QRLLAQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQ
PLHQPAMKSFLDNVMPHTTPELQKGPSPINAFSNFPIGLNSNLNVNMDMNSIKEPQSRLR
KWTTVDSISVNTSLDQNSSKHGAISSGFRLEESPFVPYDFMNSSTSPASPPGSIGDGWPR
AKSPNGSSSVNWPPEFRPGEPWKGYPNIDPETDPYVTPGSVINNLSINTVREVDHLRDRN
SGSSSSLNTTLPSTSAWSSIRASNYNVPLSSTAQSTSARNSDSKLTWSPGSVTNTSLAHE
LWKVPLPPKNITAPSRPPPGLTGQKPPLSTWDNSPLRIGGGWGNSDARYTPGSSWGESSS
GRITNWLVLKNLTPQIDGSTLRTLCMQHGPLITFHLNLPHGNALVRYSSKEEVVKAQKSL
HMCVLGNTTILAEFASEEEISRFFAQSQSLTPSPGWQSLGSSQSRLGSLDCSHSFSSRTD
LNHWNGAGLSGTNCGDLHGTSLWGTPHYSTSLWGPPSSSDPRGISSPSPINAFLSVDHLG
GGGESM*
...
You can also use UniProt's IDmapping tool at https://www.uniprot.org/uploadlists to map from gene names to UniProtKB, and then download the results in FASTA format.
Based on the comment by genomax. Uses biopython to parse the Fasta file.
#!/bin/env python3
import gzip
import io
import re
import datetime
from pathlib import Path
from Bio import SeqIO
url = "https://www.uniprot.org/uniprot/?include=false&format=fasta&compress=yes&force=true&query=proteome:UP000005640"
local = Path.home() / "Downloads/uniprot-proteome_UP000005640.fasta.gz"
# Download if not exists
if not local.exists():
import requests
with requests.get(url) as r:
with local.open(mode='wb') as fd:
for data in r.iter_content(chunk_size=(2 ** 20)):
fd.write(data)
with Path(str(local) + ".meta").open('w') as fd:
print(F"Downloaded on {datetime.datetime.utcnow()} (UTC) from", file=fd)
print(url, file=fd)
def rec_by_gene(filename):
with gzip.open(filename, mode='rb') as fd:
for rec in SeqIO.parse(io.TextIOWrapper(fd), format='fasta'):
(_, protid, desc) = rec.description.split("|")
try:
desc = re.match(r"(?P<Entry>[\w]+) (?P<Protein>.+) OS=(?P<OS>[\w\s]+) OX=(?P<OX>[\w]+) GN=(?P<GN>[\w\-]+) PE=(?P<PE>[\w]+) SV=(?P<SV>[\w]+)", desc).groupdict()
yield (desc['GN'], rec)
except AttributeError:
pass
proteins = dict(rec_by_gene(local))
genes = list(proteins.keys())
print("Got genes:", *genes[0:4], "etc.")
gene = genes[0]
print("Example: Gene", gene, "has protein sequence", proteins[gene].seq)
Output:
Got genes: A1CF A2ML1 ACOX3 SCG5 etc.
Example: Gene A1CF has protein sequence MESNHKSGDGLSGTQKEAALRALVQRTGYSLVQENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFEDELIPLCEKIGKIYEMRMMMDFNGNNRGYAFVTFSNKVEAKNAIKQLNNYEIRNGRLLGVCASVDNCRLFVGGIPKTKKREEILSEMKKVTEGVVDVIVYPSAADKTKNRGFAFVEYESHRAAAMARRKLLP
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Why not download human proteome from UniProt and parse as needed. If you need just one protein per gene then this file from same source.