Hi all,
Recently, I've had some trouble with a VCF file in which a program had turn around my REF and ALT fields (e.g. REF=C, ALT=T, but should be REF=T, ALT=C) and also their related genotypes.
I've managed to turn around the REF and ALT fields, but I'm not able to update my genotypes in an efficient way. Right now, I'm using this command:
rs=$(cat listSNPs1.txt)
for i in $rs; do
echo "SNP: $i"
sed -i -e "/$i/{s/0\/0/2\/2/g; s/1\/1/0\/0/g; s/2\/2/1\/1/g}" myfile.vcf
done
listSNPs1.txt
contains a list of the SNPs to be updated (one column, each line an rs
). With this 0/0 is updated to 2/2, 1/1 is updated to 0/0, and 2/2 is finally updated to 1/1.
This program is actually working, but it's gonna take ages, as I have more than 100.000 SNPs to update.
Does anyone know a better alternative? Maybe using vcftools
or plink
but, for the moment, I haven't been able to find any parameter that suits my necessities.
Thank you very much in advanced :)
Maybe python should be a better option here imho. You do not have to read your vcf file for each SNPs in
listSNPs1.txt
, this is time consuming. Better to make a dictionnary with your SNPs then read your vcf file line by line and check if it exists in your dictionnaryOkey, right, thanks :)! But then, how do I change the genotypes? Still with my
sed
command?? Or somehow with Python?You can use some regex in python too with
re
package. As you don't have many type of genotype, you can also write someif/else
statementsOkey, I'll try that, thanks!
If your vcf file only have SNPs (or the REF/ALT swap is just a problem for SNPs) you can use
bcftools
to fix it.