Hi all, I am trying to import a multiVCF file (created with GATK,~80 individuals (4.3Gb)) into R using the package "popgenome", and "readData". Unfortunately, the import always aborts with the error message: "R encountered a fatal error, session terminated". With a smaller data set it works fine. I also tried with compressed vcfs (bgzip) – did also not work for me. Do I miss something? Does my PC have not have enough computing resources? Has anyone had similar experiences or know how to solve this problem? About any advice I would be very grateful.
Kind regards
Pavlo
My code:
gff3_out = c()
my_filter = c()
for(chr in chromosomes){
my_filter <- list(seqid=chr)
gff3_out <- file.path(gff_path, paste(chr,".gff",sep=""))
export(readGFF("/path/to/my/gff.gff",filter=my_filter), gff3_out)
}
PopGenome::VCF_split_into_scaffolds("my_multiVCF_from_GATK.vcf","scaffoldVCFs2")
allgenomes <- PopGenome::readData("path/to/data/with_VCFs",format="VCF",gffpath = "path/to/data/gff_data",big.data = TRUE)
My PC: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz;
RAM 32,0 GB (31,9 GB verwendbar);
Systemtyp 64-Bit-Betriebssystem, x64-basierter Prozessor
https://stackoverflow.com/q/66543754/680068
Most likely that. Try to check memory usage while loading the data, maybe that helps in diagnosis. Anecdotal, I recently crashed my computer with 32GB memory when attempting a plotting operation on a smaller file.
Hi Carambakaracho, thank you for your help.
Yes you right, it's probably related to the fact that my 32Gb RAM is not enough. It's funny because in the paper (Pfeifer et al., Mol. Biol. Evol. 31(7):1929-1936 doi:10.1093/molbev/msu136) where the package is described, it says that even large data sets can be loaded without problems on normal PCs (8GB RAM). I have now tried the same task on the Linux server and it works great. Who knows where exactly the problem is.
I have now checked under Windows memory usage while the record was being read by "readData" and gave no indication that a lack of RAM may be the cause of the crash.
Unfortunately, the crash message is pretty generic and points much towards a memory issue.
Another approach would hypothesize that one or some lines in your dataset might have an offending value. So you could split the set in half and see whether you get the error with one set or the other, then split the failing set again, and so forth.
Though I wouldn't expect the error message you got for some faulty data.