How to get the gene ID
2
0
Entering edit mode
19 months ago

Good day!

We are trying to extract protein sequences from the fasta file found in NCBI. However, the IDs we have are the transcripts IDs instead of gene IDs. Is there an easy way to get the geneIDs from the fasta file using the transcript IDs that we have?

For example I have this protein sequence in fasta file format: enter image description here

then I have the transcript ID: ENSGALG00000042750.1 ENSGALG00000032142.1

However, we need to get the geneID from the fasta file using the transcript ID.

example: for ENSGALG00000042750.1 >> ENSGALP00000056694.1 for ENSGALG00000032142.1 >> ENSGALP00000046506.1

So far, we are manually putting the geneID in a txt file using the transcript ID, however, the data is almost 4,000 which means we need to manually encode 4,000 geneIDs from the transcript ID.

transcriptID geneID • 1.2k views
ADD COMMENT
1
Entering edit mode
19 months ago
Shred ★ 1.6k

Here you are. It's a modified version of this

Run it using

python3 script.py --organism gallus_gallus

It'll produce a tsv under the current directory with transcript_id / gene_id

#!/usr/bin/python3 

import sys
import os
import tempfile
import gzip
import argparse
from ftplib import FTP


def parse_attributes(kvps):
    ''' Parse the 9th column of GTF into a dictionary of attributes.'''
    f_dict = {}
    kvp = kvps.split(';')[:-1]
    for i in kvp:
        ri = i.replace(' "', '="')
        try:
            k, v = ri.split('=')
            rk = k.replace(' ', '')
            f_dict[rk] = v
        except ValueError:
            print('Malformed file.\nThis line isn\'t good. %s' % kvp)
            exit()
    return f_dict
def read_gtf_as_dict(gtf_file):
    # function reads gtf file and returns a dict with scheme
    # {ensembl_gene_id: HGNC symbol}
    output_dict = {}
    temp_dict = {}
    for line in gtf_file:
        if line.startswith('#'):
            continue
        else:
            if line.split('\t')[2] == 'transcript':
                if "gene_id" in line.split("\t")[8].rstrip():
                    temp_dict=parse_attributes(line.split('\t')[8].rstrip())
                    try:
                        transcript_id = temp_dict['transcript_id']
                        transcript_id_r = transcript_id.replace('"','')
                        ensgid = temp_dict['gene_id']
                        ensgid_r = ensgid.replace('"','')
                        output_dict[transcript_id_r] = ensgid_r
                    except KeyError:
                        print('Malformed GTF file.')
                        exit()
                else:
                    continue            
            else:
                continue
    return output_dict
def avail_organism(version):
    '''
    Function returns available organism for provided version
    '''
    ftp = FTP("ftp.ensembl.org")
    ftp.login()
    ftp.cwd('/pub/release-'+str(version)+'/gtf')
    list_organism = ftp.nlst()
    return list_organism
def get_gtf(organism, version, organism_list):
    '''
    This function downloades GTF from the provided organism.
    First it looks for correspondance than downloads the file
    '''
    if organism not in organism_list:
        print('Provided organism is not in list of Ensembl annotated organism.')
        print('Please check the available list with argument: --list Y')
        sys.exit()
    else:
        print('-------------------------------')
        print('Downloading GTF for '+organism)
        print('-------------------------------')
        ftp = FTP("ftp.ensembl.org")
        ftp.login()
        ftp.cwd('/pub/release-'+str(version)+'/gtf/'+organism)
        list_dir = ftp.nlst()
        to_download = [x for x in list_dir if x.endswith(f'{version}.gtf.gz')==True]
        tmp_file = tempfile.TemporaryFile()
        with open("tmp_file.gz", 'wb') as tmp_file:
            ftp.retrbinary(str('RETR '+to_download[0]), tmp_file.write)

parser = argparse.ArgumentParser(description="Retrieve gene ID conversion table using gene annotation from Ensembl")
parser.add_argument("--organism", help="Organism name")
parser.add_argument("--version", type=int, default=106)
parser.add_argument("-l", "--list", "--list-available", dest='list', help="Append any char to show available organisms.")
args=parser.parse_args()

if __name__ == '__main__':
    available_organism = avail_organism(args.version)
    if args.list is not None:
        for x in available_organism:
            print(x)
        exit()
    elif args.organism == None:
        print('No organism provided. Exiting..')
        exit()
    else:
        get_gtf(organism=args.organism, version=args.version, organism_list=available_organism)
        gtf = gzip.open('tmp_file.gz', 'rt')
        genes_dict = read_gtf_as_dict(gtf)
        output_filename = args.organism+'_gene_annotation.tsv'
        with open(output_filename, 'w') as oput:
            oput.write("Transcript_id"+'\t'+"Gene_id"+'\n')     
            for transcriptID in genes_dict.keys():
                oput.write(transcriptID+'\t'+genes_dict[transcriptID]+'\n')
        os.remove("tmp_file.gz")
        print('Gene list file saved under this directory at: --> '+output_filename)
        print('-------------------------------')
        print('Report any error on Gist. Enjoy')
ADD COMMENT
0
Entering edit mode

Thank you very much! It worked! May I also ask if you have also a script extracting protein IDs using transcript IDs? Thank you!

ADD REPLY
0
Entering edit mode

Or the sequenceI ID rather. We already have a list of the transcript IDs of the gene we wanted, however, in order to extract the protein sequence, we need the seqID or proteinID (the ID after ">").

ADD REPLY
1
Entering edit mode

If you find the answer useful then accept it, as it would be suggested to other users. The script works by grabbing the fields from a GTF file. It has no informations on protein IDs, so you couldn't. Use BioMart instead https://www.ensembl.org/biomart/martview/

ADD REPLY
0
Entering edit mode

Thank you!

ADD REPLY
1
Entering edit mode
19 months ago
Arunabh ▴ 10

Hi

There is a brute force method for this. You could upload the fasta sequence in tblastn and keep a filter of 100% sequence cover and blast. Usually the first hit should give you the Genbank/Refsec ID for your protein sequence.

The next one will require some scripting, but if you can run this in a loop, it will be faster than the previous method. Here, you can search for the protein ID in NCBI protein, you will get the following:

LOCUS QACA_STAAU 514 aa linear BCT 03-MAY-2023 DEFINITION RecName: Full=Antiseptic resistance protein. ACCESSION P0A0J9 VERSION P0A0J9.1 DBSOURCE UniProtKB: locus QACA_STAAU, accession P0A0J9; class: standard. extra accessions:P23215 plasmid:VRSAp,pSK1 created: Mar 1, 2005. sequence updated: Mar 1, 2005. annotation updated: May 3, 2023. xrefs: X56628.1, CAA39963.1, S12394, WP_000622776.1, YP_003813123.1, YP_536864.1 xrefs (non-sequence databases): AlphaFoldDB:P0A0J9, SMR:P0A0J9, TCDB:2.A.1.3.4, GeneID:66840931, OMA:CFQTIFY, GO:0005886, GO:0022857, GO:0046677, CDD:cd17321, Gene3D:1.20.1250.20, InterPro:IPR011701, InterPro:IPR020846, InterPro:IPR036259, InterPro:IPR005829, PANTHER:PTHR42718:SF42, PANTHER:PTHR42718, Pfam:PF07690, SUPFAM:SSF103473, PROSITE:PS50850, PROSITE:PS00216 KEYWORDS Antibiotic resistance; Cell membrane; Membrane; Plasmid; Transmembrane; Transmembrane helix; Transport.

You can see the GeneID there, 66840931. If this number is searched in NCBI gene, you will get an output that looks like this:

Snapshot of NCBI gene output

You can donwload the fasta from here. Hope this helps.

Arunabh

ADD COMMENT

Login before adding your answer.

Traffic: 2055 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6