Getting Intron Positions from Amino Acid Sequences
1
0
Entering edit mode
19 months ago
fafad046 • 0

I have a list of NCBI protein accession numbers from multiple organisms.

  1. What is the best way to get their corresponding intron positions for each amino acid sequence?

My current idea is to download all of their DNA sequences from NCBI and compare them with their corresponding mRNA sequences, but I also had trouble finding their corresponding DNA sequences from the protein accession number itself, which leads to my second followup question:

  1. How do I get DNA accession numbers from my protein accession numbers on the NCBI website?
  1. Alternatively is there any databases where I can get both the protein sequences and intron position more cleanly?

Any input is greatly appreciated, thank you!

acid positions amino sequences intron • 1.2k views
ADD COMMENT
4
Entering edit mode
19 months ago
GenoMax 147k

Using EntrezDirect (truncated because of space):

$ esearch -db protein -query NP_001243728.1 | elink -target gene | efetch -format gene_table
GAPDH glyceraldehyde-3-phosphate dehydrogenase[Homo sapiens]
Gene ID: 2597, updated on 29-Mar-2023


Reference GRCh38.p14 Primary Assembly NC_000012.12  from: 6534517 to: 6538371
RNA transcript variant 6 NR_152150.2, 7 exons,  total annotated spliced exon length: 763

Exon table for  RNA  NR_152150.2
Genomic Interval Exon           Gene Interval Exon              Exon Length     Intron Length
-------------------------------------------------------------------------------------------------
6534517-6534569         1-53            53              240
6534810-6534861         294-345         52              1632
6536494-6536593         1978-2077               100             90
6536684-6536790         2168-2274               107             129
6536920-6537010         2404-2494               91              90
6537101-6537189         2585-2673               89              911
6538101-6538371         3585-3855               271

mRNA transcript variant 4 NM_001289746.2, 8 exons,  total annotated spliced exon length: 1525
protein isoform 1 NP_001276675.1 (CCDS8549.1), 8 coding  exons,  annotated AA length: 335
ADD COMMENT
0
Entering edit mode

thank you so so much, it worked! you are an absolute life savior :D

ADD REPLY
0
Entering edit mode

Do you happen to know why sometimes my Entrez direct command gives no output at all?

I installed Entrez direct using the commands listed on the website:

sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

export PATH=${PATH}:${HOME}/edirect

Then, my below command sometimes gives my desired output (all the gene tables corresponding to that protein ID), but now it doesn't give any output at all.

esearch -db protein -query NP_001013036.1 | elink -target gene | efetch -format gene_table > whole_table.txt

Im really hoping that it'd work as it has been such a convenient tool, thank you so much in advance..

ADD REPLY
0
Entering edit mode

Did you sign up for NCBI_API_KEY? This is a public resource so if you are doing a large number of queries put a pause between query blocks. Even with NCBI API Key you are allowed a certain number of queries per unit time.

The query above does work.

$ esearch -db protein -query NP_001013036.1 | elink -target gene | efetch -format gene_table
APP amyloid beta precursor protein[Pan troglodytes]
Gene ID: 473931, updated on 31-Mar-2023


Reference NHGRI_mPanTro3-v1.1-hic.freeze_pri NC_072419.1  (minus strand) from: 24584412 to: 24299610
mRNA transcript variant X4 XM_009452766.4, 16 exons,  total annotated spliced exon length: 3430
protein isoform X4 XP_009451041.1, 16 coding  exons,  annotated AA length: 695

Exon table for  mRNA  XM_009452766.4 and protein XP_009451041.1
Genomic Interval Exon       Genomic Interval Coding     Gene Interval Exon      Gene Interval Coding        Exon Length Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
24584412-24584130       24584186-24584130       1-283       227-283     283     57      58502
24525627-24525460       24525627-24525460       58786-58953     58786-58953     168     168     21867
24503592-24503463       24503592-24503463
ADD REPLY
0
Entering edit mode

Hi GenoMax, do you happen to know where I could bulk download all the gene tables from a site? I tried looking for it from the NCBI FTP site but couldn't find it... the E-utitlities has a very strict rate limit and I kept getting restricted... Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6