Question

How I deal with this big data

1

Entering edit mode

5.6 years ago

zizigolu ★ 4.3k

Hi,

I have 40 whole genome sequencing .bam files each one about to 1 TB. Lab manager has downloaded them from Cambridge university to her hard derive. In the other hand my computer in office is windows OS so I have to use linux OS in our Compute Cluster in university to analyze my data. For that I have to drag and drop each .bam file to my scratch in Compute Cluster that each file takes about 2 days for transferring and after finishing I am seeing .bam file becomes crashed. Now, it is about a month I have been given this data but I am still struggling with transferring them. If one of you were in my place what would you do?

I thought to ask lab manager to download these .bam files directly to Compute Cluster (but she has not done yet) I thought to ask IT service to install linux on my computer although I guess I would need cluster computing again (

I really don't know what to do

Any consultant please?

WGS transfer • 2.2k views

ADD COMMENT • link updated 5.6 years ago by ATpoint 85k • written 5.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Why not ask your IT service how they advise uploading to their cluster?

ADD REPLY • link 5.6 years ago by Joe 21k

0

Entering edit mode

I asked, they are saying this is a common problem and I should use MobaxTerm to connect to HPC and drag and drop (I did and failed). Also they are saying we can mount our private filestore to HPC and from that we can copy and paste files to scratch (filestore mounted but I need permission)

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

If you really have 40Tb, don't drag/drop. Use rsync or this is gonna be a nightmare. And don't let this manager convince you otherwise. If this is really his/her advice he/she has little experience with large data handling.

ADD REPLY • link 5.6 years ago by ATpoint 85k

0

Entering edit mode

See if they're willing to just do it for you for a given fee. You have better things to spend your time on than this and they're moving data around all the time anyway.

ADD REPLY • link 5.6 years ago by Devon Ryan 104k

1

Entering edit mode

Is it not possible to directly transfer from Cambridge university to your local cluster via ssh ( rsync should be my shot) ?Otherwise go to a place where you have a fast connection (min 1 Gbs) to your cluster and launch an rsync to transfer your data.

ADD REPLY • link 5.6 years ago by Nicolas Rosewick 11k

1

Entering edit mode

If your computer is on wireless and you trying to transfer TB of data then this is a fool's errand. At least find a computer with wired ethernet. Most campuses should have gigabit ethernet to desktop (at least to some ports, if not all).

As has been said already you should sftp/wget/curl this data directly to the server. Ask lab manager for credentials if they are needed to download data.

ADD REPLY • link 5.6 years ago by GenoMax 146k

1

Entering edit mode

40TB, guessing each hard drive is 2TB, it is 20 hard drives. We could use WinSCP to transfer from Windows machine to linux scratch one at a time, depending on the speed will take about 20 days. Best to consult with IT, either direct download to scratch from Cambridge or let IT copy the drives.

Also, do you have access to 40TB space on scratch?

ADD REPLY • link 5.6 years ago by zx8754 12k

0

Entering edit mode

Sorry in front of each .bam file for example says , 105,905,561 kb and I have 40 .bam files; sorry if I am stupid in calculating how big they are

In scratch I have 4 TB space

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

105,905,561 kb is ~0.1 terabyte so you have a total of 4 terabytes of data.

ADD REPLY • link 5.6 years ago by GenoMax 146k

0

Entering edit mode

Sorry sometimes I am getting too stupid

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

No worries. Now the problem has been identified, solutions provided. You need to find/talk with the right people and execute a solution that works.

ADD REPLY • link 5.6 years ago by GenoMax 146k

0

Entering edit mode

As I am on windows system I am using MobaxTerm to connect to computing cluster so I can not use rsync otherwise I ask an office mate to connect my hard derive to his linux to be able to execute rsync

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

In any case don't try the transfer unless you have a wired ethernet connection to prevent timeouts etc.

ADD REPLY • link 5.6 years ago by GenoMax 146k

0

Entering edit mode

What makes you think you can't use rsync in Moba?

ADD REPLY • link 5.6 years ago by Joe 21k

0

Entering edit mode

Because I am not able to figure out how to point to my hard derive in this code, how this code knows where are my files? My .bam files are in an external big hard derive next to me

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

You can access the locally mounted external drive under /mnt in MobaXTerm. Use the same drive letter you see under windows. e.g./mnt/g/your_bam_files.

ADD REPLY • link 5.6 years ago by GenoMax 146k

0

Entering edit mode

As I am rubbish in command lines I installed FileZilla and WinSCP, I am trying them waiting for tomorrow to see if any file be transferred or not :(

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

It is no more difficult than opening a local terminal in MobaXterm and typing

rsync -axv --numeric-ids --progress -e "ssh -T -o Compression=no -x" /mnt/g/*.bam your_user_name@your_server_name:/folder_name_where_you_want_to_copy

Copy one file to begin with instead of *.bam, until you become comfortable.

ADD REPLY • link 5.6 years ago by GenoMax 146k

score 3 · Accepted Answer · 2019-03-20

3

Entering edit mode

5.6 years ago

ATpoint 85k

That sounds like some big amount of data and therefore it will take time. I do not think that there is any workaround for transferring them to the HPC scratch as a desktop computer is simply not powerful enough to handle these amounts of data. For transferring data from A to B I use rsync which will keep the transferred file hidden until the transfer has been finished successfully. I prefer the following command:

rsync -axv --numeric-ids --progress -e "ssh -T -o Compression=no -x" *.bam user@path_to_hpc(...):/scratch/your_username/folder...

This will take time but --progress will give you a rough estimation for each file. Check if it might not be faster to directly download to the HPC. Talk to the people involved. Depending on how fast that drive is where the data are currently stored, a new download directly to scratch might save you quite some time.

ADD COMMENT • link 5.6 years ago by ATpoint 85k

0

Entering edit mode

Sorry for being stupid but if .bam files are in a hard drive can I still use rsync ?

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Yes. For the end-user rsync is a much more elaborate version of cp as explained here.

ADD REPLY • link 5.6 years ago by ATpoint 85k

0

Entering edit mode

Thank you so much gentlemen for helping me, finally I am transferring files by command line and the speed seems reasonable. I am looking forward for the next generation of my posts here in data analysis step! :)

ADD REPLY • link 5.6 years ago by zizigolu ★ 4.3k