Use esearch on python to get nucleotide sequence from the protein ses id
1
1
Entering edit mode
5.1 years ago
Chvatil ▴ 130

I would like from a protein fasta file to get the corresponding NCBI nucleotide sequences in a new fasta file.

I know that esearch can do it in python but I do know how to handle it with fasta file etc.

I guess it should be something like:

from Bio import SeqIO
import Entrez
    with open Nucleotide_fasta_file as nuc_file:
       for record in SeqIO.parse("Protein_fasta_file", "fasta"):
           Entrez.esearch (db='protein',query=record.id)
           print(">",the nucleotide id,file=nuc_file)
           print(the nucleotide sequence,file=nuc_file)

Here is an exemple: If I have a fasta prom file such as:

>YP_001883411.1
MDEMPQHSHNLPSPPTDTLRPSSSHNGPRKENGDKDLQPLQPPSSTMGNQHNISVKIGMI
MDNFNDVIERTLNRMHIYVMTEKINFICMAQTQYHNLVFERELFDILCKNKYAIEHEAAD
NNPSDDSLHLANNDYPFSLIICSKEAKRSSILLSNVGLVLVKFILYLRMLYRNVARPDSP
NCFLAINTRILPFGRVCIDIDYKIPLLGEGDGDYDEFVRKSFELVAQYTTMGNIIMTRNC
YTPNTRSFHLITEQQFDATTRHIIFMRIAEGIRALNDNVKIDQVHVWMLPFGRGHVPVRK
YDRRLNEFMDLVYPYTEVDFELMMPFDVSQGMDNLYTLYSLNTDEVAVANGDDFNYDVLH
ECLGRDIMDSYYIGDTSDIAKNLQLLSRVLSSRYEFAFSSQFRQRYDANYLSRVMGNKFN
NAFIIMSNKKLQLETSWNIPKPKAKPRPSERYEIYRFIEHRLLGEMRTLEDLFKVMPSDL
VKRNVNIGNVDELQNEELAKRKLASEINVDSVYTGPVHNLALDIAENQCIDFVWDNIVED
TSSHPWPYVYIGRSQIVQTAMSHMRNVHEYYTRTIVDNVYTYEKHDGIMRCFENIQSMFD
GQLLADTMNALCQNIFCSDFQRVQSYLFTMYEFFCRVHYIDACVTRSEFEEVVEAYLLET
VRPMFDEADLKTYRERHVQMTPGPKAYAKFPSPCSPLARVWDSIRPANKILLHIIYLMIV
EHNYSSVFFHLHTITRNKDNSQILASLFLHIIDNSYMTTPEGDEDSGDDTMNVVRSYASK
EFLNFIYMMFINAGVNYECDLRGKNVVFSSVDSKDYINDMKELIVTSPLWFFLCNYQYVD
EQMSYSNRFDLFAAIFRQDNSAGGRQTSSPPSADQRTSNGPSPPKQRRGVSGGGSGGSGG
NNGAANFNTERVLNANNISDRYHADLLRIFFRYILAYMKTETGVYIYDGVRMMSLPFKEP
SNIPQLEVKDPVSFLGMYRHQYGIYNTWTMQMERNINVLNGQINISNDELGNYPHLFNPY
NDDIYRLLVNRFLKSITFTRVINYQKNLALFLAPIYDPNVENNLKVLNYNIDSIQINIHD
LASSEFNIPQEMFVDILDVGKKPKNKLYEMFKWLYCIVCHYSENYSCVITTPSTFIPKCM
LPECGEDKENIFSMVKGNGAMDDDDNNNNGYGGGDNFLQRLHSVLESKNREQKEIITDEL
QKLSQFELSTLVNLFKNAYMFEEDGEDAENGILEGIEGMDNNMEVDHDISAQSQTPGSPS
PQSSSANATTEEIFKFNLSVHRNMAGGGGGNGTDIMDFFEGDEFVENSKIKLLLNLFNRK
LSAEIKQMSPEAFKQIIEDNHSHHITRFVLLTLSWLIRTLHTHIFADTRFFRELQQYRQL
LYDDLSDLVFRHNGYFMYNNRITDVAQIFSHYCRHVELVVDPVFEMSMSVDRDLYLQHDA
ELEKRVSPEIVRDIEDACVSAIYQGQFIEDTNVDLSRLWARVTVPRNKHRISPLFTLHTA
TGKSEYLTERCRRHFNNKYFNNVLDPSSLAQTDHRGTDMARELNTNLIVCIEEFNSLTAK
FKQVCGYTSVAYKPLFADTKVSFQNNSTVILSTNNDPKCNEEAIVARLHVYPRRIQYANV
NKYLKFQRSSMLASTSLLKINNIMSVQMIMEKMPRVLAENYRGNFMMTWLLKRFFLFNII
DHVTVHTSETLQNHINNFYTMINAQEFVLQRLDMTTTSTMTLVQFRRLVNRICEENRSLF
NTKIDTYNVYKILCDRLKALINNDQQTIRISEKNDNALRQ

>YP_009345696.1
MDRYFINEMQLFERKLTFDSNNDYHHLEIISSHENQHIKKTKITLFTYNLLLDCLHYFYM
KCVDSNLFYDSGLTLVLHKEKKIFLNQLIFDVDFKSANISAIISKEKINDYNDFLNERNN
IIKNMIYIIFKYLKIDFTVENIQKYCSVTSRPLKLSFHLHIFYHVDYFTERILRFKMHNN
WKTFVTSMDTNFILDEPILHSLPFSRQHRPNKIHCQQTESEDVNAICINVDLLFLKDSLD
ISISNETWKLFSIFNIEKVLLKNISLLTKKKSDFTILIKEDYGNEINIFNGKKIGFFSFE
DINNFISTSMEIENVKNKEIGIIDEDEYNIHIPSVYINHIKRQHKIILYDIKDSIVHESI
DSVLFFDLLKYASFLSRYNKTDNFSINLSTKNYNENENIDEFYETVNLFSLKYKKDFFSI
FDDASNDEKEQNNIENSCEFDENDVYRRGEKNKEEINIICSDEKMKKFQHFSDSSFFPFT
EKQIHFLESIINKYNEDILSETDNAVCSKVLEYLESLNSLYPPLLFLLGSSFFNYSNDED
LQLISMYINNVDVNLPKIQTNKKRKLTTSKVDKNDIQRILKMDYKWIKEILNYIINCGCV
TSSVYLLSRIGVFQSLFDNIYYILSSMSLTLHAQYILSKWIQIVPEPITFFSSSFYNSDI
AYMCLILFDFNEKSREHKEKKNFLSGDNLNIKFIYSLFIKILSKFEIEYNSELEVFFITQ
ILMCRSLGYRAIYFNGHGHVISQDTSFLKDFCEKKSTLMLPKYKATDDVLNSFVYIENVG
IFNTLFNVYEFPSPSLNSLVTHKLPSVLTYCDNNINYFSHTTYPLLQNFIFELYGKMFHF
SKKCKENVVSLILFSPNVRISKKCPEILMELDILNFYDSSFIFLLEDYAKILNNHSNIEE
YWNDDFISIITTTNDKYSFFLKRLLFIFYIIFQENKILNFSNIILFIQTLFGQSKYEQLG
LRKQTFFKNINANNNKNNNNNCTTTDDASNGENKDIEMIEDDKLYHKTISYSAKHFTDKM
EDNTCFFNDIRKRINSVLYEQHELKKKIINETILTTKIIKPSLLSKCEEQNNYMNFCLSD
YYENVDDIILPFDKNDKSLFNTKIKNHFLNVLLNIDNSENNDYIELISSIDLTKQFKKNL
YFLVLLLNWFLKMGNIHAYSNTSFFKDIQKYRQEIYNLMTNHVLKSFGPLLKFSSSFHLA
NIIQKFNENTELDFDFVLEKFSHLKHPFYEKNNNKKLLNNENIKKKYFSISDRDIIADNE
MLRSKNHIYHAMAVLLMFSEFNFDTLTDVIKFMIYILYKGNMLRICLYLFGVTESLKSRF
TEILANLTQTDQTQSFNNANISRSAVQDFDSIVISGASNTFIFFDEVEKVCITRFKTIVN
SIAMSSRDIKNSEAINLKLACTPLMSSNSPFVVDNASCARLRPIRKKMQFCETFSGETFD
RFSIDSLKEYISTPNIGGIFLINRIPMWHDNSASIIGFLLIQRYLYPYFFKSYTSPISKK
MSKTMKNELRIYMSNNHPVAFFLNTVKISVDHNFISEQDFYKLIDKWWLAYKDRFKNSDF
DSKSLVKEISQYLTQYKSTLNNVKGYNIKIVLE
esearch python entrez • 1.3k views
ADD COMMENT
0
Entering edit mode
5.1 years ago
GenoMax 147k

You should only need the accession number not the actual sequence to get nucleotide sequence (sequence truncated for brevity)

$ efetch -db protein -id "YP_001883411.1" -format fasta_cds_na
>lcl|NC_010671.1_cds_YP_001883411.1_1 [locus_tag=MdSGHV083] [db_xref=GeneID:6295473] [protein=hypothetical protein] [protein_id=YP_001883411.1] [location=complement(85755..91097)] [gbkey=CDS]
ATGGACGAAATGCCCCAGCACAGCCACAATCTCCCCTCGCCGCCAACGGACACACTCCGACCATCATCGT
CCCACAACGGACCGAGGAAGGAGAACGGGGACAAGGATCTGCAACCACTGCAACCGCCGTCATCAACAAT
GGGCAATCAGCATAACATCAGCGTCAAGATCGGTATGATAATGGATAATTTTAATGATGTAATCGAGCGT
ACACTGAATCGTATGCATATCTATGTAATGACCGAGAAAATAAACTTTATCTGCATGGCGCAGACACAGT
ATCACAATTTGGTCTTTGAGCGAGAACTTTTCGACATTCTGTGCAAGAACAAGTATGCCATCGAACATGA
ADD COMMENT
0
Entering edit mode

Ok thank you I tried in python :

handle = Entrez.efetch(db="sequences", id="YP_001883411.1", rettype="fasta_cds_na", retmode="text")
ADD REPLY
0
Entering edit mode

So you were able to make it work? I can move my comment to an answer in that case.

ADD REPLY
0
Entering edit mode

yes it is done thank you :)

ADD REPLY

Login before adding your answer.

Traffic: 1979 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6