Question

Extracting the coding sequences for a set of protein IDs - NCBI.

0

Entering edit mode

8.3 years ago

Tom ▴ 50

I have a set of protein IDs. In reality there are thousands, but for this example, there are these three NCBI IDs:

XP_016775379.1
XP_008018068.1
XP_007991648.1

I want to extract, automatically (as there are in reality thousands of sequences), the coding sequence that encodes each of these proteins (i.e. the coding sequence should be 3 times as long as the protein sequence).

My problem is that, for example, the last protein "XP_007991648.1", is encoded by the mRNA: XM_007993457.1. The full mRNA sequence is here. However, the actual coding sequence that encodes the specific protein that I want (i.e. it is a subset of the full mRNA sequence) is here. So I do not want to extract the full mRNA sequence, only the section of the mRNA that encodes the particular protein.

The ultimate aim is to extract the longest canonical transcript for each gene, an issue I've been trying to solve here.You can see from this post, I have tried lots of different ways to get this to work, but so far no luck.

If anyone could tell me specifically how to do this, possibly with a working command that works for these examples so that I can see how it works, I would appreciate it.

ncbi coding sequence protein parsing eutils • 2.5k views

ADD COMMENT • link updated 8.3 years ago by Sej Modha 5.3k • written 8.3 years ago by Tom ▴ 50

score 0 · Answer 1 · 2016-08-02

0

Entering edit mode

8.3 years ago

Sej Modha 5.3k

Have you tried NCBI e-utils?

esearch -query "XP_016775379.1" -db protein |elink -target nuccore|efetch -format fasta_cds_na

ADD COMMENT • link 8.3 years ago by Sej Modha 5.3k

0

Entering edit mode

Thank you for the reply.

So for example, if I run the command:

esearch -query "XP_007991648.1" -db protein | elink -target nuccore | efetch -format fasta_cds_na >> out_test3

According to NCBI, there is one coding sequence/one protein with the name "XP_007991648.1" here. So I want my output file to have one sequence, the exact coding sequence that encodes this protein (this protein is the longest transcript for this gene, the ultimate aim is to have the longest transcript for each gene).

When I run that exact command, there are 9 genes in the output. I ran the exact same command two seconds later (everything was the same) and there was 8 genes in the output, and then I ran the exact same command a third time, and there was 10 sequences in the output.

Having said that, I can see the right answer is in the file, the sequence with the gene name: lcl|XM_007993457.1_cds_XP_007991648.1_1 [gene=SCGB1C1] [protein=secretoglobin family 1C member 1]

The problem is that there is other genes in the same file e.g.

lcl|NC_023642.1_cds_XP_007991648.1_1 [gene=SCGB1C1] [protein=secretoglobin family 1C member 1] [protein_id=XP_007991648.1]

lcl|NC_023642.1_cds_XP_007982587.1_7 [gene=RIC8A] [protein=synembryn-A isoform X2] [protein_id=XP_007982587.1] 

lcl|NC_023642.1_cds_XP_007992831.1_9 [gene=RIC8A] [protein=synembryn-A isoform X5] [protein_id=XP_007992831.1] [location=join(19091..19174,19508..19555,19644..20234,20803..20894,21483..21633,22710..2

If I put in one protein sequence, and ask for the exact coding sequence that encodes that protein, I would expect one sequence in the output file?

In addition, I ran this command: elink -db assembly -target nuccore -id 733711 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na

I ran this command because I emailed the NCBI to ask them how to do this and they replied "Once installed and setup, this chained command set (described just above this sentence) will get you the CDS sequences...which is test verified here on linux". But this command gives me multiple transcripts of the same gene, and also, twice, it cut off after returning ~500 sequences; which I think may be an issue with a connection timing out.

ADD REPLY • link 8.3 years ago by Tom ▴ 50

0

Entering edit mode

If I put in one protein sequence, and ask for the exact coding sequence that encodes that protein, I would expect one sequence in the output file?

You would not as there are more codons than there are amino acids, hence you could have thousands if not millions of nucleotide sequences that code for the same amino acid sequence.

ADD REPLY • link 8.3 years ago by Jenez ▴ 540