Retrieving all sequences of specific gene from an organism
0
0
Entering edit mode
6.2 years ago
marcoooo • 0

Hi,

I know that similar questions have been asked here, but I still haven't found a fitting answer.

I need to download all the nucleotide sequences of a specific gene of a virus from GenBank. Not only it is difficult to find the published sequences of the gene itself, but I would also like to find the ones within whole genomes (annotated ones of course).

For instance, let's say I need all nucleotide sequences of the "EBNA-1" gene of the Human Herpesvirus 4. Is there a way to download a fasta of all published EBNA-1s, included the ones annotated in complete genomes? The number of sequences I'm looking at are way too much to do it manually, but all serache I did give me sequences of almost random organisms. I have been mainly used the NCBI website to test the searches, and eUtils (esearch and efetch) for the downloads.

Thanks a lot in advance.

Best, Marco

gene genome database virus • 1.5k views
ADD COMMENT
0
Entering edit mode

If you have already tried eUtils can you tell us how you did the search. Did that method not work?

ADD REPLY
0
Entering edit mode

Mainly I tried this:

> esearch -db nucleotide -query "search_terms" | efetch -format fasta

As search terms I tried different combination, such as "EBNA-1", "EBNA-1 AND human herpesvirus 4", etc... The results I have are usually a few of the published sequences plus whole genomes.

ADD REPLY
0
Entering edit mode

Hi,

I think you can try NCBI properly. There are several options like 1. Search your gene in NCBI and fetch all published articles related to your gene. ( As I simply tested your gene name and found total article is only ~2500. ) 2. Use pubmed batch download and get these articles first. 3. First confirm the gene IDs and other information 4. You can directly download the fasta sequence from NCBI

According to my point of view its easy....:)

Enjoy

ADD REPLY
0
Entering edit mode

Hi,

I apologize, but I think I'm not understanding exactly the process. How do I get from the articles to the IDs of the sequence to download from NCBI (not doing it one by one I mean, as they are thousands as you say). I can download all the sequences published with a paper, but I'll have many whole genomes, and sequences I'm not interested in, as usually they do not only publish a sequences of a single gene.

Thanks!

ADD REPLY
0
Entering edit mode

You should be able to modify my script, here, such that it returns nucleotide sequences instead of protein sequences: A: How to download all sequences of a list of proteins for a particular organism

I tested it for your gene already and it works:

/usr/bin/python2.7 NucFASTASearchByFASTATitle.py -e myemail@email.ie -t "EBNA-1"
>YP_001129471.1 EBNA-1 [Human herpesvirus 4 type 2]
MSDEGPGTGPGNGLGQKEDTSGPDGSSGSGPQRRGGDNHGRGRGRGRGRGGGRPGAPGGSGSGPRHRDGV
RRPQKRPSCIGCKGAHGGTGAGGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGA
GAGGGAGAGGAGAGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGA
GAGGAGAGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGA
GAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGGRGRGGSGGRGRGGSGGRGRGGS
GGRRGRGRERARGGSRERARGRGRGRGEKRPRSPSSQSSSSGSPPRRPPPGRRPFFHPVAEADYFEYHQE
GGPDGEPDMPPGAIEQGPADDPGEGPSTGPRGQGDGGRRKKGGWYGKHRGEGGSSQKFENIAEGLRLLLA
RCHVERTTEDGNWVAGVFVYGGSKTSLYNLRRGIGLAIPQCRLTPLSRLPFGMAPGPGPQPGPLRESIVC
YFIVFLQTHIFAEGLKDAIKDLVLPKPAPTCNIKVTVCSFDDGVDLPPWFPPMVEGAAAEGDDGDDGDEG
GDGDEGEEGQE

>YP_401677.1 nuclear antigen EBNA-1 [Human gammaherpesvirus 4]
MSDEGPGTGPGNGLGEKGDTSGPEGSGGSGPQRRGGDNHGRGRGRGRGRGGGRPGAPGGSGSGPRHRDGV
RRPQKRPSCIGCKGTHGGTGAGAGAGGAGAGGAGAGGGAGAGGGAGGAGGAGGAGAGGGAGAGGGAGGAG
GAGAGGGAGAGGGAGGAGAGGGAGGAGGAGAGGGAGAGGGAGGAGAGGGAGGAGGAGAGGGAGAGGAGGA
GGAGAGGAGAGGGAGGAGGAGAGGAGAGGAGAGGAGAGGAGGAGAGGAGGAGAGGAGGAGAGGGAGGAGA
GGGAGGAGAGGAGGAGAGGAGGAGAGGAGGAGAGGGAGAGGAGAGGGGRGRGGSGGRGRGGSGGRGRGGS
GGRRGRGRERARGGSRERARGRGRGRGEKRPRSPSSQSSSSGSPPRRPPPGRRPFFHPVGEADYFEYHQE
GGPDGEPDVPPGAIEQGPADDPGEGPSTGPRGQGDGGRRKKGGWFGKHRGQGGSNPKFENIAEGLRALLA
RSHVERTTDEGTWVAGVFVYGGSKTSLYNLRRGTALAIPQCRLTPLSRLPFGMAPGPGPQPGPLRESIVC
YFMVFLQTHIFAEVLKDAIKDLVMTKPAPTCNIRVTVCSFDDGVDLPPWFPPMVEGAAAEGDDGDDGDEG
GDGDEGEEGQE
ADD REPLY
0
Entering edit mode

Thanks for the suggestion! I tried your script, but the number of sequences I have (even if I search for the protein ones instead of the nucleotide sequences) is really low. For instance, If I follow your example and search for "EBNA-1", I download 8 sequences (of the hundreds published for the gene). Am I missing something?

ADD REPLY
0
Entering edit mode

The other sequences may be published but have they been submitted to Entrez?

ADD REPLY

Login before adding your answer.

Traffic: 2253 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6