Question

concatenating 2 differently-imputed VCF files then filtering by imputation score to keep the higher imputation score imputed vaiant from duplicate variants

0

Entering edit mode

2 days ago

SalmaElShafie • 0

Hello, I have 2 VCFs resulting from different imputation (same tool but using different sources for reference genomes); while one of the imputed files generally has higher imputation accuracy, it fails to impute a large number of SNPs while the second one includes these variants with variable imputation score (some are very confidently imputed) so I want to add those (and add from the 2nd file instances where imputation score is higher than in the first file).

the objective: is to combine both together ending in one imputed VCF file with high scores (so if a snp is imputed in both; to only include the higher score imputed line from the corresponding VCF)

Is there a tool that can do that? concatenate both files (there will be overlap), then filter out the repeated snps of lower imputation score? something from vcftools or bcftools?

I thought of this below; but don't know how to complete it

bcftools concat -Oz -o concatenated.vcf imputedfile1.vcf.gz imputedfile2.vcf.gz
bcftools sort -Oz concatenated.vcf.gz -o sorted_concatenatd.vcf.gz
I don't want to use this approach for filtering based on hard cutoffs, but to filter out the lower of same imputed variants from both files and keep the higher bcftools view -i 'INFO/DR2>=0.8' sorted_concatenated.vcf.gz > sorted_concatenatedDR20.8.vcf
so I'm thinking of: zgrep -v "#" sorted_concatenated.vcf.gz | awk '{if (!seen[$1,$2,$3,$4,$5]++)print $0}' > uniquelines.vcf

what this does is check in the concatenated file (which does have some variant overlap 'same variant written twice but with different imputation score as it was imputed differently', but it only allows writing of the first instance of this line (when the same first 5 cols are previously written -chr,pos,ID,alt,ref-, it won't write that line). but i want to add a condition that when those first five columns are previously seen, do a different awk within the INFO column, read the DR2 'imputation score' value and keep only the line with the higher DR2 value. but I'm not sure how to do that. Would appreciate your help; I also feel that would be much slower/inefficient than using a tool for filteration like filtering by INFO score by bcftools but i'm not sure how to specify by INFO score write the higher of more than one same entry.

Thank you for your time and will greatly appreciate your support

Imputation filtration score duplicate variants • 732 views

ADD COMMENT • link 2 days ago by SalmaElShafie • 0