vcf file filter
2
I need to filter my vcf file to include variants with at least 30 individuals in each of the possible groups: major allele homozygous, heterozygous, and minor allele homozygous; would be grateful for any input. Thanks!
filter
vcf
SNPs
minor allele homozygotes
• 2.6k views
You can do this in Hail . Here's python code that should work with the 0.1 interface:
from hail import *
hc = HailContext()
vds = hc.import_vcf('my.vcf.bgz')\
.split_multi()\ # split multiallelics, because variant_qc requires biallelics
.variant_qc() # conveniently produces the metrics you want
# remove sites that fail your criteria
vds = vds.filter_variants_expr('va.qc.nHomVar >= 30 && va.qc.nHet >= 30 && va.qc.nHomRef >= 30')
# export to VCF
vds.export_vcf('filtered.vcf.bgz')
Alternatively, continue downstream analysis in Hail!
using vcffilterjdk : http://lindenb.github.io/jvarkit/VcfFilterJdk.html
java -jar dist/vcffilterjdk.jar -e 'Map<GenotypeType,Long> map=variant.getGenotypes().stream().map(G->G.getType()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) ; return map.containsKey(GenotypeType.HET) && map.get(GenotypeType.HET)>30 && map.containsKey(GenotypeType.HOM_REF) && map.get(GenotypeType.HOM_REF)>30 && map.containsKey(GenotypeType.HOM_VAR) && map.get(GenotypeType.HOM_VAR)>30 ;' input.vcf
and using the recent version of vcffilterjdk, it's even shorter:
java -jar dist/vcffilterjdk.jar -e 'com.github.lindenb.jvarkit.util.Counter<String> h = new com.github.lindenb.jvarkit.util.Counter(variant.getGenotypes().stream().map(G->G.getType().name())); return h.count("HOM_REF")>30 && h.count("HOM_VAR")>30 && h.count("HET")>30;' input.vcf
Login before adding your answer.
Traffic: 1801 users visited in the last hour
I am getting the following error: NameError Traceback (most recent call last) <ipython-input-21-2ace1e95a049> in <module>() 1 ----> 2 vds = hc.import_vcf('x.clean.tidy.vcf')
NameError: name 'hc' is not defined
In [22]: