How to determine percentage missing genotypes in VCF/BCF?
4
0
Entering edit mode
7.4 years ago
William ★ 5.3k

Is there a command line tool that has a quick function to determine the percentage of missing genotypes in a VCF/BCF file?

Or the inverse of the missing percentage, the genotype occupancy % of the VCF/BCF file?

Preferable a tool where I don't need to accumulate the missingness per sample of per variant into the matrix missingness myself.

vcf • 7.7k views
ADD COMMENT
0
Entering edit mode

RTG tools can help you

java -jar RTG.jar vcfstats input.vcf > output.txt

ADD REPLY
4
Entering edit mode
7.4 years ago

EDIT : 2nd solution 100% cmd line:

grep -v "^#" in.vcf  |\
cut -f 10- | tr "\t" "\n" | cut -d ':' -f 1 |\
awk '/^\.\/\./ {NC++;} END{printf("%f\n",NC/(1.0*NR))}'

Using bioalcidae:

var i,samples = header.getSampleNamesInOrder();
var sample2count={};
for(i=0;i< samples.size();++i) {
    sample2count[ samples.get(i) ]={"count":0,"nocall":0};
    }
while(iter.hasNext())
    {
    var vc = iter.next();
    for(i=0;i< vc.getNSamples();++i)
        {
        var g = vc.getGenotype(i);
        var data = sample2count[ g.getSampleName() ];
        data.count++;
        if(g.isNoCall()) data.nocall++;
        }
    }
for(sample in sample2count)
{
out.println(sample+"\t"+sample2count[sample].count+"\t"+sample2count[sample].nocall+"\t"+(sample2count[sample].nocall/(1.0*sample2count[sample].count)));
}

usage:

java -jar  dist/bioalcidae.jar -f script.js input.vcf
ADD COMMENT
3
Entering edit mode
2.5 years ago

plink2 --vcf <filename> --genotyping-rate

plink2 --bcf <filename> --genotyping-rate

ADD COMMENT
2
Entering edit mode
2.5 years ago
Eric ▴ 20

Here is a solution with bcftools version 1.15.1

  1. Use the +fill-tags plugin to add an INFO field, FMISS, with the fraction of samples with missing data for each site.
  2. Pipe that into bcftools query to print the FMISS field,
  3. Pipe that to awk to compute the mean of the FMISS fields.
bcftools +fill-tags /tmp/all.vcf.gz  -- -t 'FMISS=F_MISSING' |  \
bcftools query -f '%FMISS\n' |  \
awk '{sum+=$1; n++} END {print sum/n}' 

# produces 0.75136

If desired you could also modify this to compute a histogram of the number of sites missing genotypes at each of a number of individuals. For my example data set (eight samples, and lots of missing data) that looks like:

bcftools +fill-tags /tmp/all.vcf.gz  -- -t 'NMISS=N_MISSING' | \
bcftools query -f '%NMISS\n' | \
awk '{n[$1]++} END {for(i in n) print i,n[i]}' | \
sort -n -b -k 1

# produces this:
0 133
1 130
2 118
3 151
4 245
5 329
6 883
7 3249
ADD COMMENT
1
Entering edit mode
7.4 years ago
William ★ 5.3k

I always prefer not having to write custom code, but this is short and will do the trick:

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
num_var = 0
num_called = 0
for variant in reader:
     num_var += 1
     num_called += variant.num_called
num_called / (num_var * len(reader.samples)) * 100
55.44748947440677
ADD COMMENT

Login before adding your answer.

Traffic: 2691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6