Hi,
I have 40 whole genome sequencing .bam files each one about to 1 TB. Lab manager has downloaded them from Cambridge university to her hard derive. In the other hand my computer in office is windows OS so I have to use linux OS in our Compute Cluster in university to analyze my data. For that I have to drag and drop each .bam file to my scratch in Compute Cluster that each file takes about 2 days for transferring and after finishing I am seeing .bam file becomes crashed. Now, it is about a month I have been given this data but I am still struggling with transferring them. If one of you were in my place what would you do?
I thought to ask lab manager to download these .bam files directly to Compute Cluster (but she has not done yet) I thought to ask IT service to install linux on my computer although I guess I would need cluster computing again (
I really don't know what to do
Any consultant please?
Why not ask your IT service how they advise uploading to their cluster?
I asked, they are saying this is a common problem and I should use MobaxTerm to connect to HPC and drag and drop (I did and failed). Also they are saying we can mount our private filestore to HPC and from that we can copy and paste files to scratch (filestore mounted but I need permission)
If you really have 40Tb, don't drag/drop. Use
rsync
or this is gonna be a nightmare. And don't let this manager convince you otherwise. If this is really his/her advice he/she has little experience with large data handling.See if they're willing to just do it for you for a given fee. You have better things to spend your time on than this and they're moving data around all the time anyway.
Is it not possible to directly transfer from Cambridge university to your local cluster via ssh ( rsync should be my shot) ?Otherwise go to a place where you have a fast connection (min 1 Gbs) to your cluster and launch an rsync to transfer your data.
If your computer is on wireless and you trying to transfer TB of data then this is a fool's errand. At least find a computer with wired ethernet. Most campuses should have gigabit ethernet to desktop (at least to some ports, if not all).
As has been said already you should sftp/wget/curl this data directly to the server. Ask lab manager for credentials if they are needed to download data.
40TB, guessing each hard drive is 2TB, it is 20 hard drives. We could use WinSCP to transfer from Windows machine to linux scratch one at a time, depending on the speed will take about 20 days. Best to consult with IT, either direct download to scratch from Cambridge or let IT copy the drives.
Also, do you have access to 40TB space on scratch?
Sorry in front of each .bam file for example says , 105,905,561 kb and I have 40 .bam files; sorry if I am stupid in calculating how big they are
In scratch I have 4 TB space
105,905,561 kb is ~0.1 terabyte so you have a total of 4 terabytes of data.
Sorry sometimes I am getting too stupid
No worries. Now the problem has been identified, solutions provided. You need to find/talk with the right people and execute a solution that works.
As I am on windows system I am using MobaxTerm to connect to computing cluster so I can not use rsync otherwise I ask an office mate to connect my hard derive to his linux to be able to execute rsync
In any case don't try the transfer unless you have a wired ethernet connection to prevent timeouts etc.
What makes you think you can't use
rsync
inMoba
?Because I am not able to figure out how to point to my hard derive in this code, how this code knows where are my files? My .bam files are in an external big hard derive next to me
You can access the locally mounted external drive under
/mnt
in MobaXTerm. Use the same drive letter you see under windows. e.g./mnt/g/your_bam_files
.As I am rubbish in command lines I installed FileZilla and WinSCP, I am trying them waiting for tomorrow to see if any file be transferred or not :(
It is no more difficult than opening a local terminal in MobaXterm and typing
Copy one file to begin with instead of
*.bam
, until you become comfortable.