Script to obtain human protein sequences by the gene symbols
3
0
Entering edit mode
4.5 years ago
Shicheng Guo ★ 9.5k

Hi All,

I have a batch of human gene symbol name and I want to obtain the protein sequence for them. Is there any script can do that?

Thanks.

script gene protein • 2.2k views
ADD COMMENT
1
Entering edit mode

Why not download human proteome from UniProt and parse as needed. If you need just one protein per gene then this file from same source.

ADD REPLY
3
Entering edit mode
4.5 years ago

You can use XML as query for Ensembl. The following example is for two human gene symbols (MT-TF and TNRC6A).

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >       
    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
        <Filter name = "external_gene_name" value = "MT-TF,TNRC6A"/>
        <Attribute name = "peptide" />
        <Attribute name = "ensembl_gene_id" />
        <Attribute name = "ensembl_gene_id_version" />
        <Attribute name = "ensembl_transcript_id" />
        <Attribute name = "ensembl_transcript_id_version" />
        <Attribute name = "external_gene_name" />
    </Dataset>
</Query>

Format the above XML to one line and download the protein sequences using wget.

wget -O results.fasta 'http://www.ensembl.org/biomart/martservice?query=<Query virtualSchemaName="default" formatter="FASTA" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6"><Dataset name="hsapiens_gene_ensembl" interface="default"><Filter name="external_gene_name" value="MT-TF,TNRC6A"/><Attribute name="peptide"/><Attribute name="ensembl_gene_id"/><Attribute name="ensembl_gene_id_version"/><Attribute name="ensembl_transcript_id"/><Attribute name="ensembl_transcript_id_version"/><Attribute name="external_gene_name"/></Dataset></Query>'

The FASTA sequences will be in the results.fasta file.

>ENSG00000090905|ENSG00000090905.19|ENST00000568750|ENST00000568750.1|TNRC6A
XQDGIVADESQNMQFMSSQSMKLPPSNSALPNQALGSIAGLGMQNLNSVRQNGNPSMFGV
GNTAAQPRGMQQPPAQPLSSSQPNLRAQVPPPLLSPQVAMLNQLSQLNQLSQISQLQRLL
AQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQPLHQ
PAMKSFLDNVMPHTTPELQKGPS
>ENSG00000090905|ENSG00000090905.19|ENST00000450465|ENST00000450465.6|TNRC6A
QQDIVGSWGIPPATGKPPGTGWLGGPIPAPAKEEEPTGWEEPSPESIRRKMEIDDGTSAW
GDPSKYNYKNVNMWNKNVPNGNSRSDQQAQVHQLLTPASAISNKEASSGSGPKSMQDGWC
GDDMPLPGNRPTGWEEEEDVEIGMWNSNSSQELNSSLNWPPYTKKMSSKGLSGKKRRRER
GMMKGGNKQEEAWINPFVKQFSNISFSRDSPEENVQSNKMDLSGGMLQDKRMEIDKHSLN
IGDYNRTVGKGPGSRPQISKESSMERNPYFDKNGNPSMFGVGNTAAQPRGMQQPPAQPLS
SSQPNLRAQVPPPLLSPQVPVSLLKYAPNNGGLNPLFGPQQVAMLNQLSQLNQLSQISQL
QRLLAQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQ
PLHQPAMKSFLDNVMPHTTPELQKGPSPINAFSNFPIGLNSNLNVNMDMNSIKEPQSRLR
KWTTVDSISVNTSLDQNSSKHGAISSGFRLEESPFVPYDFMNSSTSPASPPGSIGDGWPR
AKSPNGSSSVNWPPEFRPGEPWKGYPNIDPETDPYVTPGSVINNLSINTVREVDHLRDRN
SGSSSSLNTTLPSTSAWSSIRASNYNVPLSSTAQSTSARNSDSKLTWSPGSVTNTSLAHE
LWKVPLPPKNITAPSRPPPGLTGQKPPLSTWDNSPLRIGGGWGNSDARYTPGSSWGESSS
GRITNWLVLKNLTPQIDGSTLRTLCMQHGPLITFHLNLPHGNALVRYSSKEEVVKAQKSL
HMCVLGNTTILAEFASEEEISRFFAQSQSLTPSPGWQSLGSSQSRLGSLDCSHSFSSRTD
LNHWNGAGLSGTNCGDLHGTSLWGTPHYSTSLWGPPSSSDPRGISSPSPINAFLSVDHLG
GGGESM*
...
ADD COMMENT
1
Entering edit mode
4.4 years ago

You can also use UniProt's IDmapping tool at https://www.uniprot.org/uploadlists to map from gene names to UniProtKB, and then download the results in FASTA format.

ADD COMMENT
1
Entering edit mode
4.4 years ago

Based on the comment by genomax. Uses biopython to parse the Fasta file.

#!/bin/env python3

import gzip
import io
import re
import datetime
from pathlib import Path
from Bio import SeqIO

url = "https://www.uniprot.org/uniprot/?include=false&format=fasta&compress=yes&force=true&query=proteome:UP000005640"

local = Path.home() / "Downloads/uniprot-proteome_UP000005640.fasta.gz"

# Download if not exists
if not local.exists():
    import requests
    with requests.get(url) as r:
        with local.open(mode='wb') as fd:
            for data in r.iter_content(chunk_size=(2 ** 20)):
                fd.write(data)
        with Path(str(local) + ".meta").open('w') as fd:
            print(F"Downloaded on {datetime.datetime.utcnow()} (UTC) from", file=fd)
            print(url, file=fd)

def rec_by_gene(filename):
    with gzip.open(filename, mode='rb') as fd:
        for rec in SeqIO.parse(io.TextIOWrapper(fd), format='fasta'):
            (_, protid, desc) = rec.description.split("|")
            try:
                desc = re.match(r"(?P<Entry>[\w]+) (?P<Protein>.+) OS=(?P<OS>[\w\s]+) OX=(?P<OX>[\w]+) GN=(?P<GN>[\w\-]+) PE=(?P<PE>[\w]+) SV=(?P<SV>[\w]+)", desc).groupdict()
                yield (desc['GN'], rec)
            except AttributeError:
                pass

proteins = dict(rec_by_gene(local))
genes = list(proteins.keys())

print("Got genes:", *genes[0:4], "etc.")

gene = genes[0]
print("Example: Gene", gene, "has protein sequence", proteins[gene].seq)

Output:

Got genes: A1CF A2ML1 ACOX3 SCG5 etc.
Example: Gene A1CF has protein sequence MESNHKSGDGLSGTQKEAALRALVQRTGYSLVQENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFEDELIPLCEKIGKIYEMRMMMDFNGNNRGYAFVTFSNKVEAKNAIKQLNNYEIRNGRLLGVCASVDNCRLFVGGIPKTKKREEILSEMKKVTEGVVDVIVYPSAADKTKNRGFAFVEYESHRAAAMARRKLLP
ADD COMMENT

Login before adding your answer.

Traffic: 2697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6