Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.
For example, I have a fasta file after a blast search -
>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus]
MKALQYTEIGSEPVVVDVPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYIYGLPLTLGHEGVGTVAELGAGVTGFET
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA
TAVFDFVGAQSTIDTAQQVVAIDGHISVVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMDVVDLARAGRLDIHTETF
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTVAELGDGVTGFET
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA
TVVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTIAELGAGVTGFEK
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAAPLTDAGLTPYH
AISRVLPLLGPGSMAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA
TAVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF
TLDEGPTAYRRLREGSIRGRGVVVPG
I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1) - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).
Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!
Thank you
Thanks genomax - I cant believe my vigilant googling didnt get me to that page explaining where all the transcripts of a reference seq is kept! Need to improve my searching skills:) I am assuming now using the Entrez etuils wrapper in Biopython can get me the necessary transcripts. I will take a look at doing it myself!
That may not be directly possible since
WP
accessions are not linked on backend with a single DNA sequence. I did try to see ifefetch
worked briefly (it did not).Is there no way to use the Entrez utils wrapper in biopython to access the "identical proteins" link ? I.e for https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1 ; this would lead to "https://www.ncbi.nlm.nih.gov/ipg/?term=WP_010594839.1" ; where any of the entries there gets me to a page which has a DNA transcript which I am happy to use. How do I get to that particular entry without web scraping or using any other tool that python/biopython?
A lot of times I find
eUtils
to be in "oh so close but no cigar" category (Disc: I may be doing something wrong as well).Something odd seems to be happening with querying
ipg
database and the retrieval format. You can test this and let me know if you see the same thing.Yup, i see the same thing. You are right, I am find the etuils hit and miss. I am resorting to webscraping the ipg page now. No other choice.