We are planing to download ridiculously bulk amount of metagenomes dataset from MG-RAST (over 2000 metagenomes, raw data). I need around 5 TB of storage for the sequences data, as I noticed that 80 metagenomes have already consumed 200 GB of space (zipped, downloaded from API site). I guess that I also need few Terabytes of space for data analysis and results. We are already ran out of disk space. Considering our current storage situation on the server, one of my colleagues suggested that - "store only FTP-addresses etc. to all the files there, so you can go back for them should you need to in the future. Storing all the data is problematic. That is that we never store any DNA sequence data on Our server, only the addresses where we got it from".
Before, that I thought ftp sites are to download data but how does it work "storing ftp-addresses"? instead of downloading sequence data for downstream analysis such as running blast etc against NCBI nr databse. Any suggestions?
How do you guys store large datasets on the server?
Agree with the comments as above. One other alternative is to stream files from FTP locations into your code, if your code/ programs can work with streaming data.
Cloud seems most scalable.
Irene