Hey all,
I already have a head start on this question (following this tutorial.) However that method is taking a _really_ long time since I have a list of ~0.5 Billion sequences to get. Additionally, some of my threads during sequence filtering are throwing errors and I'm afraid this method might not work.
So! I'm asking you if you have a better idea on how to get every bacterial protein sequence from NCBI. I don't think Edirect will work (I'll be blocked). One idea I had was if I could use esearch and efetch on a local copy of the all protein record (nr.fa). However Edirect doesn't support local queries out of the box (at least to my knowledge).
Any advice on how to wrangle Edirect to do local queries or any other ideas would be much appreciated.
You can also download
.faa.gz
files for every bacterium in RefSeq, check another tutorialThat requirement, if absolute, will not be satisfied by these two things.
Yes I know, I guess proteins of bacteria in RefSeq are enough for his/her purpose, before knowing for what he/she use the data.
Anyway, one can try
"all protein" sequences is a moving target, anyway...