Hello, I have 2 VCFs resulting from different imputation (same tool but using different sources for reference genomes); while one of the imputed files generally has higher imputation accuracy, it fails to impute a large number of SNPs while the second one includes these variants with variable imputation score (some are very confidently imputed) so I want to add those (and add from the 2nd file instances where imputation score is higher than in the first file).
the objective: is to combine both together ending in one imputed VCF file with high scores (so if a snp is imputed in both; to only include the higher score imputed line from the corresponding VCF)
Is there a tool that can do that? concatenate both files (there will be overlap), then filter out the repeated snps of lower imputation score? something from vcftools or bcftools?
I thought of this below; but don't know how to complete it
- bcftools concat -Oz -o concatenated.vcf imputedfile1.vcf.gz imputedfile2.vcf.gz
- bcftools sort -Oz concatenated.vcf.gz -o sorted_concatenatd.vcf.gz
- I don't want to use this approach for filtering based on hard cutoffs, but to filter out the lower of same imputed variants from both files and keep the higher bcftools view -i 'INFO/DR2>=0.8' sorted_concatenated.vcf.gz > sorted_concatenatedDR20.8.vcf
- so I'm thinking of: zgrep -v "#" sorted_concatenated.vcf.gz | awk '{if (!seen[$1,$2,$3,$4,$5]++)print $0}' > uniquelines.vcf
what this does is check in the concatenated file (which does have some variant overlap 'same variant written twice but with different imputation score as it was imputed differently', but it only allows writing of the first instance of this line (when the same first 5 cols are previously written -chr,pos,ID,alt,ref-, it won't write that line). but i want to add a condition that when those first five columns are previously seen, do a different awk within the INFO column, read the DR2 'imputation score' value and keep only the line with the higher DR2 value. but I'm not sure how to do that. Would appreciate your help; I also feel that would be much slower/inefficient than using a tool for filteration like filtering by INFO score by bcftools but i'm not sure how to specify by INFO score write the higher of more than one same entry.
Thank you for your time and will greatly appreciate your support