I split a large vcf file with multiple samples to small vcf.gz files using a loop. I am trying to convert these small vcf.gz files to RData files using a loop but I get an error of vector memory loss. The original file is 35 GB in size. The code works well for an original file of size 28 GB. I am using R version 3.6.0 on Mac with 16 GB RAM.
sequence <- round(seq(1,total.lines,length.out = 51))
for(i in 2:51){
print(i)
system(paste0("sed -n '",sequence[i-1],",",sequence[i],"'p input.vcf.gz > input",i-1,".vcf.gz"))
}
library(data.table)
for(i in 1:50){
print(i)
df<-fread(paste0("input",i,".vcf.gz"))
if(i==1)
COLs <- colnames(df)
colnames(df) <- COLs
save(df,file=paste0("input",i,".RData"))
}
To solve the issue of memory, I have tried
cd ~
touch .Renviron
open .Renviron
R_MAX_VSIZE=100Gb
But it did not solve the problem and R gets aborted with a fatal error. I want to make the RData files from vcf.gz files from file of 35 GB size for further analysis. Thank you for your help!
What is the final goal, so how is the output supposed to look like? I am 99.9999% sure there is a simple command-line /
bcftools
-way of doing this outside of R.Once RData files are made, I filter those for missing and redundant markers using:
Then I use this text file for making DAPC graphs in R. So my final goal here is to get a text file filtered for missing and redundant markers which can be used in R for DAPC. And I like to try various threshold values to filter out missing marker information to choose the best one. Thank you!
Please show a representative data example that indicates how the output should look like. You should not expect people to dig into your code to figure out how things might look like. Simple make a short representative input and output and paste that here.
I apologize for any confusion. I am using an input text file (merged vcf from multiple samples). For illustration, I kept four samples here. A few lines from the original input file which has a size of 35 GB looks like:
So I want to filter out
NA
which represents missing data by keeping a threshold such as 5% of missing data allowed and any redundant markers e.g., I want to keep only one copy ofT T T T
and filter the duplicates. So these lines for the final file looks like:Assume your data file is called dat, will the following code work? (Replace 0.05 with threshold you want)
Thank you, @Sam! The first issue I had with my data file was its big size. R freezes or aborts if I try to import the file directly. That's why I split it into the smaller parts and combined later after filtering.
you can in theory run the above script per small part of the file. Then you can combine the data and do a unique again in the very end