Hi,
I would like to bulk retrieve the following information for a list of virus species from NCBI.
- Representative genome (e.g Machupo mammarenavirus)
- Link to the genome (https://www.ncbi.nlm.nih.gov/genome/?term=Machupo+mammarenavirus)
replicon info
( Type Name RefSeq INSDC Size (Kb) GC% Protein Gene Chr S NC_005078.1 AY129248.1 3.44 43.4 2 2 Chr L NC_005079.1 AY358021.2 7.2 41.0 2 2)
refseq IDs for chr (S, M, L etc) (NC_005078.1, NC_005079.1)
- gene and protein IDs found in each segment/chr (Machupo virus segment S - GeneID:2943093 /locus_tag="MACVsSgp1" /db_xref="GeneID:2943093, /protein_id="NP_899212.1")
Is there a way to bulk retrieve this info? I have used efetch and esearch to retrieve sequences before but having a hard time figuring out how to get the above information. Hope someone can help me. Thank you in advance.
Download the assembly summary file for GenBank genomes. Then parse out top level assembly information for viruses from that file. Take a look at column headers to decide what you need.
Similar information also available for RefSeq genomes.
Thank you. very helpful! Now I can retrieve assembly IDs using the method you suggested.
How can I use the assembly ID to obtain all gene IDs , protein IDs, and locustags included under each assembly ID ? I will be using 200 viral genome assemblies as a query.
If you already have the assembly IDs you can download the
feature_table.txt
file associated with that assembly. It should have the information you are looking for. For example, thefeature_table.txt
file for Machupo assembly is here. You should be able to automate this by first usingesearch
andesummary
to get the FTP paths and then using thewget
orcurl
commands to download thefeature_table.txt
files.This is what @vkkodali is referring to:
This will get you the feature_table file:
Thank you @ genomax and @vkkodali
Where I have the following viruses inside virus.txt
When I use it I get the following error
So I removed the term 'mammarenavirus' and reran the command. Now I am getting unrelated results in addition to search terms. What am I doing wrong? :(
Try
Appreciate it! Thank you! I went forward and used the extracted assemblies using the following command.
It retrieved only one feature file though assembly.txt contained five assembly IDs. Following is the log after running the command.
Take a look at the 'While Loop' subsection here. Essentially, you need to add
< /dev/null
to theesearch
command as follows:Sweet! Worked like a charm! Thank you @vkkodali