I have been trying to get my hands on ATAC sequencing data analysis using this tutorial designed by Rockfeller University.
https://rockefelleruniversity.github.io/RU_ATAC_Workshop.html. I have been working in a linux cluster with lots of memory.
Following the initial code block I have been able to generate the genome index successfully.
library(Rsubread)
genome <- "ATAC_Data/ATAC_Reference/hg19_Genome.fa"
indexForSubread <- gsub("\\.fa$", "", genome)
buildindex(indexForSubread, genome, indexSplit = FALSE)
The index generation log file is is given below.
Index name : hg19_Genome ||
|| Index space : base space ||
|| Index split : no-split ||
|| Repeat threshold : 100 repeats ||
|| Gapped index : no ||
|| ||
|| Free / total memory : 927.4GB / 1007.8GB ||
|| ||
|| Input files : 1 file in total ||
|| o hg19_Genome.fa ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Check the integrity of provided reference sequences ... ||
|| No format issues were found ||
|| Scan uninformative subreads in reference sequences ... ||
|| 676060 uninformative subreads were found. ||
|| These subreads were excluded from index building. ||
|| Estimate the index size... ||
|| 8%, 1 mins elapsed, rate=2513.4k bps/s ||
|| 16%, 3 mins elapsed, rate=2423.9k bps/s ||
|| 24%, 5 mins elapsed, rate=2386.6k bps/s ||
|| 33%, 7 mins elapsed, rate=2368.8k bps/s ||
|| 41%, 9 mins elapsed, rate=2365.4k bps/s ||
|| 49%, 10 mins elapsed, rate=2371.8k bps/s ||
|| 58%, 12 mins elapsed, rate=2401.5k bps/s ||
|| 66%, 14 mins elapsed, rate=2405.9k bps/s ||
|| 74%, 15 mins elapsed, rate=2451.3k bps/s ||
|| 83%, 17 mins elapsed, rate=2468.2k bps/s ||
|| 91%, 18 mins elapsed, rate=2490.8k bps/s ||
|| 17.4 GB of memory is needed for index building. ||
|| Build the index... ||
|| 8%, 3 mins elapsed, rate=1157.4k bps/s ||
|| 16%, 7 mins elapsed, rate=1108.6k bps/s ||
|| 24%, 11 mins elapsed, rate=1102.2k bps/s ||
|| 33%, 15 mins elapsed, rate=1101.6k bps/s ||
|| 41%, 19 mins elapsed, rate=1111.6k bps/s ||
|| 49%, 22 mins elapsed, rate=1122.2k bps/s ||
|| 58%, 26 mins elapsed, rate=1139.1k bps/s ||
|| 66%, 30 mins elapsed, rate=1144.3k bps/s ||
|| 74%, 33 mins elapsed, rate=1171.2k bps/s ||
|| 83%, 36 mins elapsed, rate=1186.4k bps/s ||
|| 91%, 39 mins elapsed, rate=1203.3k bps/s ||
|| Save current index block... ||
|| [ 0.0% finished ] ||
|| [ 10.0% finished ] ||
|| [ 20.0% finished ] ||
|| [ 30.0% finished ] ||
|| [ 40.0% finished ] ||
|| [ 50.0% finished ] ||
|| [ 60.0% finished ] ||
|| [ 70.0% finished ] ||
|| [ 80.0% finished ] ||
|| [ 90.0% finished ] ||
|| [ 100.0% finished ] ||
|| ||
|| Total running time: 89.4 minutes. ||
||Index /labs/markay/Aranyak/atacseq-rockfeller/hg19_Genome was successf ...
I tried to execute the next step following the next code block
read1 <- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz"
read2 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz"
outBAM <- "ATAC_50K_2.bam"
align(indexForSubread, readfile1 = read1, readfile2 = read2, output_file = outBAM,
nthreads = 2, type = 1, unique = TRUE, maxFragLength = 2000)
When I try to run the R code directly in the cluster it gives the following error
Out of memory. If you are using Rsubread in R, please save your working environment and restart R.
Error in .load.delete.summary(output_file[i]) :
Summary file /labs/markay/Aranyak/atacseq-rockfeller/Sorted_ATAC_50K_2.bam.summary was not generated! The program terminated wrongly!
Hence to get rid of the memory issue I wrapped the second code block as an R script so that I can allocate enough memory and run it as a batch script in the cluster with the following command.
#!/bin/bash
#SBATCH --job-name=atacseq_align
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=batch
#SBATCH --cpus-per-task=10
#SBATCH --time=6:00:00
#SBATCH --account=markay
#SBATCH --mem=500GB
#SBATCH --output=my_script_test_4.log
# load modules
ml R/4.1.2
# execute script
Rscript my_script.R
I get the following error
Error in .check_and_NormPath(index, mustWork = FALSE, opt = "index") :
object 'indexForSubread' not found
Calls: align -> .check_and_NormPath
Execution halted
I have also checked with the Rsubread vignette on their test data and that looks fine.
Any help with someone more experienced in ATAC seq will be greatly appreciated.
I don't think this has anything to do with ATAC-Seq. Show us the content of your
my_script.R
file. In all probability, the error lies there.The content of my_script.R file is the second code block already stated.