I have the following problem: I want to select only unspliced sequences from a list of blast result. I use a local version of the nt database. The only way I see at the moment is to use entrez to get the genbank files for each accession and than check in the Locus field for the molecular type.
This comes with a few drawbacks. I can use entrez with a list of accession in which case I get a continuous list of all genbank files in the list. Which can take quite some time to parse in case the list contains accession for full chromosomes. The other way would be to make a entrez request for each accession individually. Which makes alot of request and subsequently I need to set some down time between request or I will get a 429 http error. Which again prolongs the process.
As this whole thing should be used for a web service I do not know which sequences will be submitted and hence I possibly need to know the molecular type for all sequences in the nt data base.
So the best solution for me would be to have a local data base which tells me for all NCBI accession the molecular type. This would speed up the process tremendously.
So the best way I see at the moment is to download all https://ftp.ncbi.nlm.nih.gov/genbank/gbbct*.seq.gz
files and parse them locally. Or is there a better way?
Okay interesting while trying to clarify my self for that comment I figured outthat your approach is way better. Until now I used the biopython module entrez which reallybadly scaled especially when trying to parse allot big gen bank files. But the bashpipeline does not seem to have this down side. So I would say thanks.
Be sure to sign up for NCBI API key if you are planning to do a lot of look-ups.
Okay I figured out another way how this can be done and it is even quicker.
$ esummary -db nuccore -id NG_011749.1,NM_000240,F12345,AF223456 | xtract -pattern DocumentSummary -element AccessionVersion Biomol
This is much quicker than your version if the genbank file become bigger.