Question

How to extract protein sequences from published paper

0

Entering edit mode

3.3 years ago

Nelo ▴ 20

Hi

Is there any way to extract multiple protein sequences given in the published paper using either its PMID, DOI or Supplementary files.

Thanks

DOI supplementary extract protein • 1.4k views

ADD COMMENT • link updated 3.3 years ago by GenoMax 148k • written 3.3 years ago by Nelo ▴ 20

0

Entering edit mode

It's unlikely you will be able to go directly from a paper DOI to a genetic sequence. If the paper lists the databases they uploaded the data to, with accession numbers etc, then it might be possible, but we'd need more information about what the paper says exactly.

ADD REPLY • link 3.3 years ago by Joe 21k

0

Entering edit mode

Yes some paper mentioned about the accession number but other paper haven't mentioned accession number of protein other than the number of protein they got while doing genome-wide studies of specific plant species. That's why I am looking for some program using the title,PMID or DOI to download.

ADD REPLY • link 3.3 years ago by Nelo ▴ 20

0

Entering edit mode

Caveat: This is likely not going to work for most papers. But if you have the right PMID then you could do the following.

$ esearch -db pubmed -query 22753475 | elink -target nuccore | elink -target protein | efetch -format fasta | grep ">" | head -10
>NP_001292578.1 uncharacterized protein LOC103503105 [Cucumis melo]
>NP_001284396.1 uncharacterized LOC103502119 [Cucumis melo]
>NP_001284656.1 Transcription factor HY5-like [Cucumis melo]
>NP_001284432.1 ABSCISIC ACID-INSENSITIVE 5-like protein 2-like [Cucumis melo]
>NP_001284448.1 Sodium/hydrogen exchanger 2-like [Cucumis melo]
>NP_001284444.1 TMV resistance protein N-like [Cucumis melo]
>NP_001284453.1 ethylene receptor 1 [Cucumis melo]
>NP_001284384.1 alpha-farnesene synthase [Cucumis melo]
>NP_001284474.1 profilin [Cucumis melo]
>NP_001284461.1 translationally-controlled tumor protein homolog [Cucumis melo]

ADD REPLY • link 3.3 years ago by GenoMax 148k

0

Entering edit mode

First of of thank you so much for replying again

So the number '22753475' is the PMID I guess but what about the last line 'grep ">" | head -10' for? Are we limiting the number of result we want, because you got exactly the 10 result here

And it's been 10 mins now I executed this command and still its under process

ADD REPLY • link 3.3 years ago by Nelo ▴ 20

0

Entering edit mode

22753475 is the PMID. I added the part starting with grep onwards to demonstrate that this works. You will need to take that part out to save the sequence. Simply redirect to a file esearch .. blah > seq.fa.

ADD REPLY • link 3.3 years ago by GenoMax 148k