The bulk amount of fastq.gz files were stored in the server while i need to download them into my workfolder/laptop. However, i have only a 800GB external hardisk and a laptop with 500GB HDD.I have never handle such big data so far . According to my colleagues, the previous user used her laptop to run this crazy 2.6TB of data, sounds impossible to me... Anyone has any suggestion?
This isn't really a bioinformatics research question and may be closed because of that reason.
2.6 TB of raw data is conceivable for a large consortium project but sounds a bit suspicious for an individual project. Perhaps you are misinformed about the real size of the data. Have you actually checked the size of the data?
If it is already on a server somewhere then your best bet is to do all computing right there. It is going to be a fools errand to try and move that data to a local laptop.
you are right thank you, someone use du -sh to tell me the file size wondering if it could be wrong
What is the data? - TCGA (cancer data)? I'm currently downloading 19 TB of data but I have gone through the correct procedure of contacting the head of IT to let him/her know and to get advice. Certainly not going to analyse it on a laptop.
I had the fun of analyzing such TBs of data recently and even considering to do that on a laptop is insane (sorry, no offence to OP). It is not only the amount of raw data, but also the amount of intermediate files until you have your final output. Also, aligning these data with the 2, 4 or 8 threads on a laptop and maybe some 8 or 16GB RAM will take ages. As a crude benchmark, on our HPC cluster, trimming and aligning a paired-end WGS sample (human, 500mio paired-end reads, 2x100bp, each fastq.gz about 80GB at compression level 5) with 32 cores @2.4GHz and 120GB RAM, directly piped into
sambamba sort
and a duplicate marking tool, takes about 8 hours, if I remember correctly. On top, you have to add the time to process the data based on your scientific question. I am also not sure if it is healthy for your laptop to run at full load for weeks because that is probably the time it will take. Get in contact with your IT department and ask for advice. You need a proper server node for this and adequate storage solutions.Hi, thank you so much for your reply and i was asked to submit a grant proposal regarding CPU, RAM and HDD and probably a desktop in two days time. As the desktop gonna be placed at my personal workbench, with budget 7000 dollar, would it be possible to hear some advice from you in getting these hardware?
Does your institution own a high-performance cluster or do you really have to buy a desktop?
They dont own a HPC but they are interested to do modification to a Lenovo desktop that i m currently using. The 7k dollar would be invested in the modification, however, i ll request for more fund if buying a new desktop is necessary. They are expecting to see some fruitful output from the 2.6TB before i could ask for more in next year, there are another 4 TB of data coming soon......Thank you so much for the prompt reply
You should at least invest in a decent dual socket xeon workstation with plenty of RAM. $7K may get you something but these workstations can become pricey and can cost a LOT more than $7K. Do you ever hope to work on all of that data at the same time because if you do, it would not be a great use of $7K to upgrade an existing desktop.
You should really be looking at cloud providers or other infrastructure to do this analysis properly.
Thanks genomax!! i would like to work on all data in once but it seems like i have to separate them into batch ...start searching about dual socket workstation like HP Z820 Tower Workstation 16-Core E5-2670 (16 cores, hard disk storage15TB, RAM32 GB).... and looks like i ll have to add more RAM to it. ITS people from my place couldnt give me advice on this but they said will help me to set up. Also, did you mean looking for cloud providers to ask for their advice on this purchase or data analysis?
Thanks for the reply, will contact ITS asap!
That is a good move. Even if you have to spend a couple of days bargaining with these guys, it is time well invested, and will save you in the end a lot of time and nerves!
Thank you, it is population disease data
Obviously you cannot download the whole data until you have a larger harddisk.
As always the question is what do you want to do with this data? Depending on this and what tools are needed to fulfill the task, it might be possible to only download a subset of the data or have remote access on the data.
fin swimmer
Want like to start with fastqc to check the quality, so the subset sounds like a plan....remote access of data couldnt work because they dont want ppl to get too close to the server, some weird regulation
An external hard drive is cheap. One can easily get 4TB of storage for under $100 and 8TB for $150.
While that is true on paper, question is does it make sense to try to download 2.6 TB of data to an external drive and then try to do analysis with it .. on a laptop.
greyman : You don't need to close a post if you have received satisfactory answers.
closing
posts is an action used by mods to close posts that may be off-topic for this forum.thank you, still learning...