Entering edit mode
4.5 years ago
curious
▴
820
Right now I am running:
bcftools sort --max-mem 30G --temp-dir {temp_dir} {in_bcf} -Ob > {out_bcf}"
Which has been running for several days and is still not done. Is there a faster way to sort a bcf? As far as I can tell bcftools does not have an option to parallelize this step. Could sorting like this even be done in parallel?
Would something like this be even faster, even though I think there would be several day overhead in conversions back and forth from bcf to vcf.
bcftools view -H bcf_unsorted.bcf | sort --buffer-size=80% --parallel=24 -k1,1 -k2,2n --temporary-directory=temp_dir | bcftools view -h - | bcftools view -Ob > bcf_sorted.bcf
just curious: what was your upstream process that led to such an unsorted big file?
Its just a lot of samples with a lot of sites. I had to add a few sites and I used
bcftools concat
to do that. Now I am just sorting to get them in the right order.Did you include all reference bases, too?
It is imputed, so it has all sites in the imputation panel, which is very dense and I have many samples. So not all positions in reference if I understand your meaning correctly.
I wondered if your input is not already sorted.
It was sorted at first by chromosome and position, there are on average about 10M variants in each chromosome. I added a few thousand with
bcftools concat
, now I have to sort again by chromosome and position, since it appears concat just adds new variants the new lines at the end of the file.My problem is basically just to add a few thousand new sites to an already sorted bcf with millions of sites and a number of samples that makes sorting unwieldy. Originally I was using
concat
andsort
to do this.Another idea I just had was to use
bcftools view -R
to get get all sites leading up to my first new site that I want to add, add the new site with concat, then use bcftools view -R again to get everything between that site and the next site to add, then cat that, rinse repeat. This would end up with me running concat and passing thousands of bcfs as arguments though, I would put them in a file list, but this is the general idea:bcftools concat {first_region.bcf} {first_new_variant.bcf} {second_region.bcf} {second_new_variant.bcf}...
How big is your BCF?
chromosome 2 is 340 GB.