I would like to download all bacterial proteins from Genbank, but non redundant. Meaning that the entries which are exactly (!) identical get removed.
What is the most efficient way to do this?
I could use taxonomy browser and then get GIs of all bacterial products. But then it is redundant. The other problem is that I can not download the file with GIs since it seems to be too big. If I retrieve NCBI protein clusters I get entries removed which are not exactly the same, right?
Is it possible to get the faa sequences from all bacterial sequences in the nr
blast db as a fasta via blastdb_aliastool
?
Or do I have to use eutils
?