Please help me how to download billion of fasta sequences quickly using the python script, shell script, or an awk script
If so please provide the script
Please help me how to download billion of fasta sequences quickly using the python script, shell script, or an awk script
If so please provide the script
Here is a way to get half a billion sequences quite quickly actually:
Get the blast NR database:
time update_blastdb.pl --decompress --source aws --num_threads 10 nr
it downloads 384 GB of data in about an hour (I sure have fast internet here!) and prints:
Connected to AWS
real 65m29.036s
user 9m16.289s
sys 65m22.974s
if you want to turn that into fasta (don't do it though!) you could then do:
blastdbcmd -db nr -entry all > halfbillion.fa
I was baffled by this answer, and now even more so that it got 3 votes.
One could argue that the OP asked for any billion fasta sequences as the question is not worded with enough detail, but that is unlikely to be the case. I don't think it is best practice to give an answer that could potentially tie up the network for hours and require 384 GB of disk space without being clear that's what the OP wants. When we add that most people outside of educational and government institutions can't download 384 GB in one hour, or that update_blastdb.pl
doesn't come standard on most systems, it doesn't seem at all like an answer to this question - regardless of the fact that it was not asked with enough details.
I should have posted with a "tongue in cheek" symbol, making it clear I was somewhat joking.
as I read the original post, I got curious about how one would realistically get a billion fasta sequences from the web - and I also happened to need to download nr for work - thus the "answer"
in a nutshell, I really don't see how a regular person could make a billion requests over the web to download fasta files or how they would even organize those files on their system without introducing various breakage of basic commands.
but as it turns out, a blast database is a perfectly usable way to both download and maintain that information - and I posted with information on sizes and download speeds primarily because I was hoping it would be educational that way. If anyone needs to store/distribute very large number of fasta files, storing them as a blast database might be a good way to go about it.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Why do you need a billion fasta sequences?
Download what? From where? How are you going to store all this data? Is it one file or a billion?
If you expect to be able to work on a billion files, you are going to run into LOTS of additional problems.
This question is unanswerable as it stands.