Question

Converting RefSeq protein accession IDs into entreZ IDs

0

Entering edit mode

2.3 years ago

Pegasus ▴ 130

Hi, I have a list of genes with Refseq accession ids and I want to convert it to EntrezID, which can then be fit in the GENE ONTOLOGY enrichment and pathway analysis like DAVID and gProfile (these IDs belong to a bacterial specie that is not supported by ensemble nor gProfile.

I followed the post;

Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession

and still not able to convert these IDs, because it is different organism/specie. These RefSeq IDs were extracted from the reference.genome.gtf file (downloaded from NCBI)

Examples of these RefSeq protein accessions like below:

WP_007431075.1 WP_010344636.1 WP_017427837.1 WP_014278738.1 WP_010344656.1 WP_019688556.1 WP_016819793.1 WP_007724645.1 WP_016821111.1 NA WP_010347944.1 WP_016819622.1 NA

Could you please suggest any website/ tool or R-package,

Thank you

RNA-SEQ • 1.2k views

ADD COMMENT • link 2.3 years ago by Pegasus ▴ 130

1

Entering edit mode

WP* accession numbers refer to multiple genomes. See: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/

The best you could do is to get the IPG ID's.

$ efetch -db protein -id WP_017427837 -format ipg
Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
38029250        RefSeq  NZ_AMQU01000019.1       79641   80531   -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus sp. ICGEB2008     ICGEB2008 GCF_000307675.1
38029250        RefSeq  NZ_CP023711.1   1197135 1198025 -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa  C12     GCF_022649565.1

ADD REPLY • link 2.3 years ago by GenoMax 153k

0

Entering edit mode

Thank you GenoMax, since efetch function is not supported by the HPC I am working on, I replaced it with the command below;

 curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_017427837&rettype=ipg&retmode=text"

It worked well as below, so;

Which number does represent the ipg_ID?
can we modify it to work automatically through a list of 4000 IDs in csv.file ? and produce a list of their corresponding IPG_IDS as output.csv?
Should I re-convert the IPG IDs into entreZ in which I can advance to gene ontology/ pathway analysis, if yes, what tool do you recommend?

 Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
    38029250        RefSeq  NZ_AMQU01000019.1       79641   80531   -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus sp. ICGEB2008     ICGEB2008  GCF_000307675.1
    38029250        RefSeq  NZ_CP023711.1   1197135 1198025 -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa  C12     GCF_022649565.1
    38029250        RefSeq  NZ_JWJJ01000001.1       862877  863767  -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa A18      A18GCF_000809185.2
    38029250        INSDC   AMQU01000019.1  79641   80531   -       KKD53569.1      protein lplB    Paenibacillus sp. ICGEB2008     ICGEB2008       GCA_000307675.1
    38029250        INSDC   CP023711.1      1197135 1198025 -       UNL92992.1      sugar ABC transporter permease  Paenibacillus polymyxa  C12     GCA_022649565.1

ADD REPLY • link 2.3 years ago by Pegasus ▴ 130