I'm trying to run the RnBeads package with the output files from bismark coverage2cytosine
call.
I keep getting the following error -
2020-05-13 16:19:00 3.7 8e-07 STATUS STARTED Setting up Multicore
2020-05-13 16:19:00 3.7 8e-07 INFO Using 2 cores
2020-05-13 16:19:00 3.7 8e-07 STATUS COMPLETED Setting up Multicore
2020-05-13 16:19:00 3.7 8e-07 STATUS STARTED RnBeads Pipeline
2020-05-13 16:19:00 3.7 8e-07 INFO Analysis Title: Arno A. Bi-Sulphite Seq data analysis 2nd try - bismarkCytosine files
2020-05-13 16:19:00 3.7 8e-07 INFO Initialized report index and saved to index.html
2020-05-13 16:19:00 3.7 8e-07 STATUS STARTED Loading Data
2020-05-13 16:19:00 3.7 8e-07 INFO Number of cores: 2
2020-05-13 16:19:00 3.7 8e-07 INFO Loading data of type "bs.bed.dir"
2020-05-13 16:19:00 3.7 8e-07 STATUS STARTED Performing loading test
2020-05-13 16:19:00 3.7 8e-07 INFO The first 10000 rows will be read from each data file
2020-05-13 16:19:00 3.7 8e-07 INFO No column with file names specified: will try to find one
2020-05-13 16:19:00 3.7 8e-07 STATUS STARTED Loading Data From BED Files
2020-05-13 16:19:01 3.7 8e-07 STATUS STARTED Automatically parsing the provided sample annotation file
2020-05-13 16:19:01 3.7 8e-07 STATUS Potential file names found in column 3 of the supplied annotation table
2020-05-13 16:19:01 3.7 8e-07 STATUS COMPLETED Automatically parsing the provided sample annotation file
2020-05-13 16:19:01 3.7 8e-07 INFO Reading BED file: ***/GID4.R1.coverage2cytosine.bed
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got 'CG'
Calls: rnb.run.analysis ... read.table.ffdf -> do.call -> read.delim -> read.table -> scan
Execution halted
The files I'm using looks like that:
head ***/GID4.R1.coverage2cytosine.bed
10 10494 + 0 CG CG CGT
10 10495 - 0 CG CG CGG
10 10497 + 0 CG CG CGC
10 10498 - 0 CG CG CGA
10 10508 + 0 CG CG CGA
10 10509 - 0 CG CG CGG
10 10523 + 0 CG CG CGA
10 10524 - 0 CG CG CGC
10 10535 + 0 CG CG CGC
10 10536 - 0 CG CG CGG
I can understand the message, but I can't figure out, where it looks for the integers and find these CG
.
To run the script i use the following in an script script:
rnb.options(
analysis.name = "bismarkCytosine files",
email = "******",
identifiers.column = "sampleID",
replicate.id.column = "treatment",
import.bed.style = "bismarkCytosine",
assembly = "hg38",
colors.meth = c("#EDF8B1","#41B6C4","#081D58"),
colors.category = c("#1B9E77","#D95F02","#7570B3","#E7298A","#66A61E",
"#E6AB02","#A6761D","#666666","#2166AC","#B2182B",
"#00441B","#40004B","#053061"),
qc.coverage.plots = TRUE,
qc.coverage.histograms = TRUE,
qc.coverage.violins = TRUE,
inference = TRUE,
inference.targets.sva = "treatment", # Column names in the sample annotation table for which surrogate variable analysis (SVA) should be conducted.
inference.sva.num.method = "be", #What function should be used to estimate the number of surrogate variables.
differential.comparison.columns = c("WT_GID4", "WT_MAEA", "GID4_MAEA"), # Column names in the sample annotation table to be used for group definition in the differential methylation analysis.
differential.comparison.columns.all.pairwise = c("WT_GID4", "WT_MAEA", "GID4_MAEA"), # Column names in the sample annotation table to be used for group definition in the differential methylation analysis in which all pairwise comparisons between groups should be conducted (the default is "one vs all" if multiple groups are specified in a column).
region.types = c("genes", "promoters"), # Region types to carry out analysis on (this would remove the analysis for sites [done only to shorten the process.])
differential.site.test.method = "limma", # Method to be used for calculating p-values on the site level.
differential.variability = TRUE, # With this analysis, the variance inside each group is sused to detect differences among them.
differential.variability.method = "diffVar",
differential.enrichment.go = TRUE, # whether Gene Ontology (GO)-enrichment analysis is to be conducted on the identified differentially methylated regions.
differential.enrichment.lola = TRUE, # whether LOLA-enrichment analysis is to be conducted on the identified differentially methylated regions.
differential.enrichment.lola.dbs = c("${LOLACore}"), # database for LOLA enrichment analysis
export.to.trackhub = c("bigBed","bigWig"), # create tracks and hub structure
export.to.csv = TRUE,
# several parameters for better memory management and parallel processing
disk.dump.big.matrices = TRUE,
disk.dump.bigff = TRUE,
logging.disk = TRUE,
enforce.memory.management = TRUE
)
theme_set(theme_bw())
num.cores = 2
parallel.setup(num.cores)
rnb.run.analysis(dir.reports = report.dir, sample.sheet = sample_annotations,
data.dir = bed.dir, data.type = "bs.bed.dir",
build.index = TRUE,save.rdata = TRUE, initialize.reports = TRUE)
# shut down parallel processing
parallel.teardown()
The only two things I have changed from the run with the cov output file from bismark are the line
import.bed.style = "bismarkCytosine",
and a different list of input files. In the bismark run my coverage2cytosine files were chormosome-specific, so I concatenate them with cat
. Can this cause a problem?
Any ideas?
thanks Assa
Does it make any difference that in RnBeads' vignette the example input files for
bismarkCytosine
(and the rest) has chromosomes in the "chr" format instead of just numbers?I hope not, as these are only the annotations, depending on the source of the genome, e.g. Ensembl vs. UCSC