Entering edit mode
4.6 years ago
curious
▴
820
I want to replace sites in BCF B
with those that appear in BCF A
BCF A
:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chrX 2781512 chrX:2781512:A:G A G . PASS GT: 0|0
chrX 2781514 chrX:2781514:C:A C A . PASS GT: 0|1
chrX 2781518 chrX:2781518:A:G A G . PASS GT: 0|1
BCF B
:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chrX 2781514 chrX:2781514:C:A C A . PASS GT: 0|0
I want BCF C
:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chrX 2781512 chrX:2781512:A:G A G . PASS GT: 0|0
chrX 2781514 chrX:2781514:C:A C A . PASS GT: 0|0
chrX 2781518 chrX:2781518:A:G A G . PASS GT: 0|1
Right now I am basically removing chrX:2781514:C:A
from BCF A
, then I think I have to concat BCF A
and BCF B
to get BCF C
, then sorting BCF C
, kind of like this:
bcftools view -e ID=@{remove_snps_list} {BCF A} -Ob > {BCF A_filtered}
bcftools concat {BCF A_filtered} {BCF C} -Ob > {BCF C}
bcftools sort {BCF C} -Ob > {BCF C_sorted}
This is going to take forever with the size of my files, is there a better way?
Pipe the bcftools commands to save on IO time.
Other than that though, the three step approach seems reasonable and should have the desired effect?
Also would the BCF be loaded completely into memory before the sort step, since this I think can only be done with a complete BCF rather than a stream of sites?
Yeah the steps seem good - multiple self-contained steps are better than one quashed up vague operation/script.
I'm not sure if the entire BCF will be loaded into memory - it doesn't seem necessary for your case - one could stream one VCF, seek to locations on the other using the index and then replace entries, but I'm not sure how bcftools works.