Question

How to determine percentage missing genotypes in VCF/BCF?

0

Entering edit mode

7.4 years ago

William ★ 5.3k

Is there a command line tool that has a quick function to determine the percentage of missing genotypes in a VCF/BCF file?

Or the inverse of the missing percentage, the genotype occupancy % of the VCF/BCF file?

Preferable a tool where I don't need to accumulate the missingness per sample of per variant into the matrix missingness myself.

vcf • 7.7k views

ADD COMMENT • link updated 2.5 years ago by chrchang523 11k • written 7.4 years ago by William ★ 5.3k

0

Entering edit mode

RTG tools can help you

java -jar RTG.jar vcfstats input.vcf > output.txt

ADD REPLY • link 7.4 years ago by reza ▴ 300

score 4 · Accepted Answer · 2017-07-02

EDIT : 2nd solution 100% cmd line:

grep -v "^#" in.vcf  |\
cut -f 10- | tr "\t" "\n" | cut -d ':' -f 1 |\
awk '/^\.\/\./ {NC++;} END{printf("%f\n",NC/(1.0*NR))}'

Using bioalcidae:

var i,samples = header.getSampleNamesInOrder();
var sample2count={};
for(i=0;i< samples.size();++i) {
    sample2count[ samples.get(i) ]={"count":0,"nocall":0};
    }
while(iter.hasNext())
    {
    var vc = iter.next();
    for(i=0;i< vc.getNSamples();++i)
        {
        var g = vc.getGenotype(i);
        var data = sample2count[ g.getSampleName() ];
        data.count++;
        if(g.isNoCall()) data.nocall++;
        }
    }
for(sample in sample2count)
{
out.println(sample+"\t"+sample2count[sample].count+"\t"+sample2count[sample].nocall+"\t"+(sample2count[sample].nocall/(1.0*sample2count[sample].count)));
}

usage:

java -jar  dist/bioalcidae.jar -f script.js input.vcf

score 3 · Accepted Answer · 2022-05-17

3

Entering edit mode

2.5 years ago

chrchang523 11k

plink2 --vcf <filename> --genotyping-rate

plink2 --bcf <filename> --genotyping-rate

ADD COMMENT • link 2.5 years ago by chrchang523 11k

score 2 · Accepted Answer · 2022-05-17

Here is a solution with bcftools version 1.15.1

Use the +fill-tags plugin to add an INFO field, FMISS, with the fraction of samples with missing data for each site.
Pipe that into bcftools query to print the FMISS field,
Pipe that to awk to compute the mean of the FMISS fields.

bcftools +fill-tags /tmp/all.vcf.gz  -- -t 'FMISS=F_MISSING' |  \
bcftools query -f '%FMISS\n' |  \
awk '{sum+=$1; n++} END {print sum/n}' 

# produces 0.75136

If desired you could also modify this to compute a histogram of the number of sites missing genotypes at each of a number of individuals. For my example data set (eight samples, and lots of missing data) that looks like:

bcftools +fill-tags /tmp/all.vcf.gz  -- -t 'NMISS=N_MISSING' | \
bcftools query -f '%NMISS\n' | \
awk '{n[$1]++} END {for(i in n) print i,n[i]}' | \
sort -n -b -k 1

# produces this:
0 133
1 130
2 118
3 151
4 245
5 329
6 883
7 3249

score 1 · Accepted Answer · 2017-07-02

I always prefer not having to write custom code, but this is short and will do the trick:

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
num_var = 0
num_called = 0
for variant in reader:
     num_var += 1
     num_called += variant.num_called
num_called / (num_var * len(reader.samples)) * 100
55.44748947440677