Hello everyone!
I have a very large VCF file (>400gb), and I want to divide it to use with VEP. VEP recommends separating the vcf, so I generated a list of contigs, based on the header, with 3^7 bases for each chromosome. This gave me a list list like this:
All the alt/small contigs are excluded because there are no variants within them.
And I have my chromosomes separated in different vcf files from a preprocessing
chr1.vcf.gz
chr2.vcf.gz
etc
But separating chromosomes like:
bcftools view -r "$CHR" bigvcffile.vcf > "$CHR".vcf
Seems very inefficient as bcftools will run the filter on the big file (which takes 5-6 hours), all the times I separate a chromosome. I did this because this process would be worsened if I did a length filtering with 100 pieces, each one filtering based on the original VCF
How can I use bcftools to split these vcf based on my file in a more efficient way? Not sure if it is even possible
A more efficient way would be to make a 2 column CSV file, one column with the count value and the other with the chr:start-end range, then use a single loop to do the actual computation. That way, you could even use GNU parallel