Hi all,
background: I need to be able to recognize all possible sequence identifiers present in preformatted NCBI nucleotide databases. I've implemented regular expression following https://www.ncbi.nlm.nih.gov/Sequin/acc.html, but it is not enough. Other accessions (e.g. PDB) are also present. So I would like to have examples of all possible formats I can encounter. But I was not able to find any list which would describe what actually can be inside those databases.
One possible solution, I thought would be to use ENTREZ to retrieve the accessions for me. There is blastdbinfo
database which lists the avalible databases. But I not able to get elink
to link anywhere.
Lets for example focus on refseq_genomes
.
The database is available with following command:
esearch -query refseq_genomes[DB] -db blastdbinfo
So given that I want nucleotide sequence accessions present in that database what the elink statement should be?
esearch -query refseq_genomes[DB] -db blastdbinfo | ... SOME ELINK .... | efetch --format acc
For ENTREZ experts here - How do I tell which database links where?
I know I can download the databases and use blastdbcmd
to obtain the accessions, but It should be possible to obtain the accessions in some better way.
Thank you
For a given
db
, you can find all available link names and a brief description as follows:The Entrex Link Descriptions webpage also lists this information but I am not sure how up-to-date that is. It looks like
blastdbcmd
may be the best solution for you.Thank you for the link. According to that it looks like there is no direct link between
blastdbinfo
and e.g.nuccore
.