Question

Downloading bulk amount of data from ftp site

4

Entering edit mode

10.2 years ago

bioinfo ▴ 840

We are planing to download ridiculously bulk amount of metagenomes dataset from MG-RAST (over 2000 metagenomes, raw data). I need around 5 TB of storage for the sequences data, as I noticed that 80 metagenomes have already consumed 200 GB of space (zipped, downloaded from API site). I guess that I also need few Terabytes of space for data analysis and results. We are already ran out of disk space. Considering our current storage situation on the server, one of my colleagues suggested that - "store only FTP-addresses etc. to all the files there, so you can go back for them should you need to in the future. Storing all the data is problematic. That is that we never store any DNA sequence data on Our server, only the addresses where we got it from".

Before, that I thought ftp sites are to download data but how does it work "storing ftp-addresses"? instead of downloading sequence data for downstream analysis such as running blast etc against NCBI nr databse. Any suggestions?

How do you guys store large datasets on the server?

fastq MG-RAST • 3.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by bioinfo ▴ 840

Ram · Answer 1 · 2015-03-19

I will start by saying that 5TB is a pretty small dataset in the genomics era, so if you are going to be working with genomics datasets, consider procuring some storage.

I think what you colleague is suggesting is that since the samples are independent of each other, you needn't download all the data at one time. If you have a script that normally starts with a file name, just have your script start instead with "download the file". At the end of the processing, simply remove the original file since it is present on the ftp server.

An alternative approach is to use the cloud for this type of work since you can easily access large storage and as much compute as you need. When you are done, you simply remove everything except the processed data.