Question

Script to obtain human protein sequences by the gene symbols

0

Entering edit mode

4.5 years ago

Shicheng Guo ★ 9.6k

Hi All,

I have a batch of human gene symbol name and I want to obtain the protein sequence for them. Is there any script can do that?

Thanks.

script gene protein • 2.3k views

ADD COMMENT • link updated 4.4 years ago by user_without_id ▴ 150 • written 4.5 years ago by Shicheng Guo ★ 9.6k

1

Entering edit mode

Why not download human proteome from UniProt and parse as needed. If you need just one protein per gene then this file from same source.

ADD REPLY • link 4.5 years ago by GenoMax 148k

score 3 · Answer 1 · 2020-06-04

You can use XML as query for Ensembl. The following example is for two human gene symbols (MT-TF and TNRC6A).

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >       
    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
        <Filter name = "external_gene_name" value = "MT-TF,TNRC6A"/>
        <Attribute name = "peptide" />
        <Attribute name = "ensembl_gene_id" />
        <Attribute name = "ensembl_gene_id_version" />
        <Attribute name = "ensembl_transcript_id" />
        <Attribute name = "ensembl_transcript_id_version" />
        <Attribute name = "external_gene_name" />
    </Dataset>
</Query>

Format the above XML to one line and download the protein sequences using wget.

wget -O results.fasta 'http://www.ensembl.org/biomart/martservice?query=<Query virtualSchemaName="default" formatter="FASTA" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6"><Dataset name="hsapiens_gene_ensembl" interface="default"><Filter name="external_gene_name" value="MT-TF,TNRC6A"/><Attribute name="peptide"/><Attribute name="ensembl_gene_id"/><Attribute name="ensembl_gene_id_version"/><Attribute name="ensembl_transcript_id"/><Attribute name="ensembl_transcript_id_version"/><Attribute name="external_gene_name"/></Dataset></Query>'

The FASTA sequences will be in the results.fasta file.

>ENSG00000090905|ENSG00000090905.19|ENST00000568750|ENST00000568750.1|TNRC6A
XQDGIVADESQNMQFMSSQSMKLPPSNSALPNQALGSIAGLGMQNLNSVRQNGNPSMFGV
GNTAAQPRGMQQPPAQPLSSSQPNLRAQVPPPLLSPQVAMLNQLSQLNQLSQISQLQRLL
AQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQPLHQ
PAMKSFLDNVMPHTTPELQKGPS
>ENSG00000090905|ENSG00000090905.19|ENST00000450465|ENST00000450465.6|TNRC6A
QQDIVGSWGIPPATGKPPGTGWLGGPIPAPAKEEEPTGWEEPSPESIRRKMEIDDGTSAW
GDPSKYNYKNVNMWNKNVPNGNSRSDQQAQVHQLLTPASAISNKEASSGSGPKSMQDGWC
GDDMPLPGNRPTGWEEEEDVEIGMWNSNSSQELNSSLNWPPYTKKMSSKGLSGKKRRRER
GMMKGGNKQEEAWINPFVKQFSNISFSRDSPEENVQSNKMDLSGGMLQDKRMEIDKHSLN
IGDYNRTVGKGPGSRPQISKESSMERNPYFDKNGNPSMFGVGNTAAQPRGMQQPPAQPLS
SSQPNLRAQVPPPLLSPQVPVSLLKYAPNNGGLNPLFGPQQVAMLNQLSQLNQLSQISQL
QRLLAQQQRAQSQRSVPSGNRPQQDQQGRPLSVQQQMMQQSRQLDPNLLVKQQTPPSQQQ
PLHQPAMKSFLDNVMPHTTPELQKGPSPINAFSNFPIGLNSNLNVNMDMNSIKEPQSRLR
KWTTVDSISVNTSLDQNSSKHGAISSGFRLEESPFVPYDFMNSSTSPASPPGSIGDGWPR
AKSPNGSSSVNWPPEFRPGEPWKGYPNIDPETDPYVTPGSVINNLSINTVREVDHLRDRN
SGSSSSLNTTLPSTSAWSSIRASNYNVPLSSTAQSTSARNSDSKLTWSPGSVTNTSLAHE
LWKVPLPPKNITAPSRPPPGLTGQKPPLSTWDNSPLRIGGGWGNSDARYTPGSSWGESSS
GRITNWLVLKNLTPQIDGSTLRTLCMQHGPLITFHLNLPHGNALVRYSSKEEVVKAQKSL
HMCVLGNTTILAEFASEEEISRFFAQSQSLTPSPGWQSLGSSQSRLGSLDCSHSFSSRTD
LNHWNGAGLSGTNCGDLHGTSLWGTPHYSTSLWGPPSSSDPRGISSPSPINAFLSVDHLG
GGGESM*
...

score 1 · Answer 2 · 2020-07-23

1

Entering edit mode

4.4 years ago

Elisabeth Gasteiger ★ 2.4k

You can also use UniProt's IDmapping tool at https://www.uniprot.org/uploadlists to map from gene names to UniProtKB, and then download the results in FASTA format.

ADD COMMENT • link 4.4 years ago by Elisabeth Gasteiger ★ 2.4k

score 1 · Answer 3 · 2020-07-23

Based on the comment by genomax. Uses biopython to parse the Fasta file.

#!/bin/env python3

import gzip
import io
import re
import datetime
from pathlib import Path
from Bio import SeqIO

url = "https://www.uniprot.org/uniprot/?include=false&format=fasta&compress=yes&force=true&query=proteome:UP000005640"

local = Path.home() / "Downloads/uniprot-proteome_UP000005640.fasta.gz"

# Download if not exists
if not local.exists():
    import requests
    with requests.get(url) as r:
        with local.open(mode='wb') as fd:
            for data in r.iter_content(chunk_size=(2 ** 20)):
                fd.write(data)
        with Path(str(local) + ".meta").open('w') as fd:
            print(F"Downloaded on {datetime.datetime.utcnow()} (UTC) from", file=fd)
            print(url, file=fd)

def rec_by_gene(filename):
    with gzip.open(filename, mode='rb') as fd:
        for rec in SeqIO.parse(io.TextIOWrapper(fd), format='fasta'):
            (_, protid, desc) = rec.description.split("|")
            try:
                desc = re.match(r"(?P<Entry>[\w]+) (?P<Protein>.+) OS=(?P<OS>[\w\s]+) OX=(?P<OX>[\w]+) GN=(?P<GN>[\w\-]+) PE=(?P<PE>[\w]+) SV=(?P<SV>[\w]+)", desc).groupdict()
                yield (desc['GN'], rec)
            except AttributeError:
                pass

proteins = dict(rec_by_gene(local))
genes = list(proteins.keys())

print("Got genes:", *genes[0:4], "etc.")

gene = genes[0]
print("Example: Gene", gene, "has protein sequence", proteins[gene].seq)

Output:

Got genes: A1CF A2ML1 ACOX3 SCG5 etc.
Example: Gene A1CF has protein sequence MESNHKSGDGLSGTQKEAALRALVQRTGYSLVQENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFEDELIPLCEKIGKIYEMRMMMDFNGNNRGYAFVTFSNKVEAKNAIKQLNNYEIRNGRLLGVCASVDNCRLFVGGIPKTKKREEILSEMKKVTEGVVDVIVYPSAADKTKNRGFAFVEYESHRAAAMARRKLLP