Question

runing R script on VCF dile in ubuntu

0

Entering edit mode

2.2 years ago

Eliza ▴ 40

Hi i have a code in R that looks like this ( was tested in small data works)

library("R.utils")

library("vcfR")

library("stringr")

library("tidyverse")

library("dplyr")
gunzip("gnomad.exomes.r2.1.1.sites.21.vcf.bgz", "gnomad.exomes.r2.1.1.sites.21.vcf")
vc=read.vcfR("gnomad.exomes.r2.1.1.sites.21.vcf")
df=vc@fix
data=as.data.frame(df)
data_snp=data %>%
  filter(str_length(ALT)==1 & str_length(REF)==1)#filtering for SNPs
data_snp$snp_id <- str_c("chr21","-", data_snp$POS)
data_snp$AF_total=str_extract(data_snp$INFO, "(?<=AF=)[^;]+")
data_snp$AF_latin=str_extract(data_snp$INFO, "(?<=AF_amr=)[^;]+")
data_snp$INFO=NULL
data_snp$key=str_c(data_snp$snp_id,"-",data_snp$ALT,"-",data_snp$REF)
df_for_filter=read.csv("merged_df.csv")
df_x=subset(df_for_filter,df_for_filter$CHROM=="chr21")
df_x$snp_id=str_c("chr21","-",df_x$POS)
df_x$key=str_c(df_x$snp_id,"-",df_x$ALT,"-",df_x$REF)
filter_df=data_snp %>% semi_join(df_x, by = "key")
results=read.csv("result_armitage_test.csv")
results$chrom=substr(results$snp_id,1,5)
results_y=subset(results,results$chrom=="chr21")
resuts_21=merge(x = results_y, y =filter_df, by = "snp_id", all.x = TRUE)

write.csv(resuts_21,"result_chr21.csv")

i tried to send a job on my university cluster as sujested in the comments :

#!/bin/bash

#SBATCH --time=01:00:00
#SBATCH --ntasks=2
#SBATCH --mem=2G

module load tensorflow/2.5.0
R4  /sci/home/xxxx/chr_21.R

the end result of the code should be a csv file as written in R code: write.csv(resuts_21,"result_chr21.csv")

i got in my PWD this file:slurm-4706808.out . as I understand the job is finished. i want to transfer this job to my local PC using :

rsync -ave "ssh -p 12345" user.name@localhost:/sci/home/user.name/xxx /Users/lee/Downloads/xxx

and then open the file in R studio on my PC . what i dont understand is how to do this if the output in the cluster is in the form of slurm-4706808.out and i need it to be a CSV file . I'm new to ubuntu:( would appreciate any help

i tried to open it on my PC as txt file and got :var/spool/slurmd/job4707601/slurm_script: line 8: R4: command not found

ubuntu vcf gnomad R • 1.4k views

ADD COMMENT • link 2.2 years ago by Eliza ▴ 40

1

Entering edit mode

How did you run that on the cluster, did you use the scheduler it probably has or did you just type it on the head/login node. The "killed" in the end assumes that there is sort of a problem with resources, memory, that sort of thing.

ADD REPLY • link 2.2 years ago by ATpoint 87k

0

Entering edit mode

Hi , I just ran the command Rscript chr_21.R on Ubuntu University cluster . Didn't use job schedule

ADD REPLY • link 2.2 years ago by Eliza ▴ 40

1

Entering edit mode

I obviously don't know your cluster, but if it's anything like the one I work on, running this command on the login/head node would be a violation of the cluster's policies and could result in your process being killed either manually by a sys admin or automatically via the cgroups system (or something similar).

You could try running it via the job scheduler and see if it works successfully. This is almost certainly the way your cluster is designed to be used anyway.

ADD REPLY • link 2.2 years ago by Dave Carlson ★ 2.1k

1

Entering edit mode

That's bad. I suggest to read the documentation of your cluster or get in contact with the IT people. Most clusters have schedulers in place, the headnode is often just to orchestrate everything, with limited CPU and memory. It's not intended for heavy work, and often jobs get automatically (and rightfully -- it's not a work platform :) ) killed.

ADD REPLY • link 2.2 years ago by ATpoint 87k

0

Entering edit mode

When you ssh to your universities cluster, you are on a login node that is used to interact with the nodes that perform the actual computations, but not meant to execute computations itself. Submitting a job to the scheduler is the proper way to ensure a fair sharing of the available compute resources. Rules similar to those therefore apply to virtually all HPC clusters:

Never run calculations on the home disk
Always use the job queueing system
The login nodes are only for editing files and submitting jobs
Do not run calculations interactively on the login nodes

Since R usually loads all data into memory, your job likely by far exceeds the resources you are allowed to use on the login node and therefore your job is killed. Either submit your calculations as job or request dedicated resources temporarily for interactive use. This usually works via a command like srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i, but may vary. Your HPC centre certainly has a manual on interactive use and the job queue.

ADD REPLY • link 2.2 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Random guess: Perhaps you are exceeding the amount of memory allocated to this job. You say you are using the University cluster. If you are using a job scheduler, like slurm or sge, or you allocating enough memory to it? If you are ssh'ing to a server, does it have enough memory?

ADD REPLY • link 2.2 years ago by dariober 15k

0

Entering edit mode

Hi , I just ran the command Rscript chr_21.R on Ubuntu University cluster . Didn't use job schedule

ADD REPLY • link 2.2 years ago by Eliza ▴ 40