Hi, I'm trying to build hisat2 index with my own SNP vcf file. I noticed that for SNPs with multiple alternative alleles, hisat2_extract_snps_haplotypes_VCF.py
only write the first alternative allele into the output .snp file. my code is hisat2_extract_snps_haplotypes_VCF.py --non-rs GRCh37.genome.fa test.vcf.gz test
I tried 2 ways in the input vcf:
22 40042284 22:40042284 A T,G . PASS AF=0.02106;MAF=0.02106;R2=0.92725 GT:DS:GP 0|1:1:0,1,0
and
22 44676852 22:44676852 T G . PASS AF=0.61037;MAF=0.38963;R2=0.99104 GT:DS:GP 1|0:1:0,1,0 22 44676852 22:44676852 T A . PASS AF=0.00574;MAF=0.00574;R2=0.72263 GT:DS:GP 1|0:1:0,1,0
the output is:
22:40042284.0 single 22 40042283 T
22:44676852 single 22 44676851 G
I wonder if I should change the name of snp into something like 22:44676852.0 22:44676852.1 to force hisat2_extract_snps_haplotypes_VCF.py
output both alternative alleles? I'm worried that if I do so, something would go run when I run hisat2-build
thanks in advance
I don't know about
HISAT2
, but I know valid vcf can only contains 1 position per line, your 2 input must invalid. I think your input vcf should like:Your only need to present position 1 time for multiple
SNP
. See VCF fotmat doc.thanks Matthew. my vcf contains the result of genotype array so I think that's why different alternative alleles are separated.