Missing data per site
3
1
Entering edit mode
2.6 years ago
anna ▴ 20

Hi,

I want to calculate statistics of missing data per each site in my vcf file.

Using vcftools --missing-site gives wrong stats for several sites.

Is there is any other way to calculate it?

Thank you!

I have 36 samples and here is an example of the vcftools --missing-site output for four positions below

chr1    10616   rs376342519     CCGCCGTTGCAAAGGCGCGCCG  C       194.49  PASS    AC=3;AF=0.083;AN=36;DB;DP=132;ExcessHet=0.1902;FS=0;InbreedingCoeff=0.4307;MLEAC=5;MLEAF=0.139;MQ=38.12;MQRankSum=-0.842;QD=24.31;SOR=1.329;VQSLOD=-6.944;culprit=MQ    GT:AD:DP:GQ     0/0:2,0:2:6     ./.:0,0:0:.     ./.:0,0:0:.     ./.:0,0:0:.     0/0:5,0:5:15    0/0:2,0:2:3    ./.:0,0:0:.      1/1:0,3:3:10    ./.:0,0:0:.     ./.:0,0:0:.     0/0:9,0:9:24    0/0:11,0:11:27  ./.:0,0:0:.     ./.:0,0:0:.     0/0:9,0:9:15    0/0:19,0:19:54  0/0:4,0:4:12    0/0:9,0:9:2     ./.:0,0:0:.     0/0:9,0:9:9     0/0:8,0:8:18   0/0:8,0:8:24     0/0:6,0:6:15    ./.:0,0:0:.     ./.:0,0:0:.     0/0:2,0:2:3     0/0:6,0:6:6     ./.:0,0:0:.     ./.:0,0:0:.     ./.:0,0:0:.     0/0:8,0:8:24    ./.:0,0:0:.     0/1:3,2:5:76    ./.:0,0:0:.     ./.:0,0:0:.     ./.:2,0:2:.
chr2    120680534       rs74654922      A       ATGTGTGTG       12887.7 PASS    AC=7;AF=0.175;AN=40;DB;DP=751;ExcessHet=3.0103;FS=0;InbreedingCoeff=0.5714;MLEAC=12;MLEAF=0.3;MQ=58.52;NEGATIVE_TRAIN_SITE;QD=29.16;SOR=0.883;VQSLOD=-1.646;culprit=MQ  GT:AD:DP:GQ     1|1:0,10:12:30  0|0:0,1:17:48   .|.:0,2:16:.    0|0:0,1:21:60   0/0:0,1:6:16    ./.:38,0:38:.  0|0:0,0:24:72    ./.:21,0:21:.   0/0:0,0:13:39   .|.:0,1:23:.    0/0:0,2:11:48   0|0:0,1:22:63   .|.:0,0:18:.    ./.:20,0:20:.   ./.:15,0:15:.   1/0:0,12:30:99  0|0:0,0:12:36   1|1:0,19:20:57  .|.:0,1:25:.    ./.:22,0:22:.   0|0:0,0:19:57  0|0:0,0:13:39    .|.:0,1:16:.    0|0:0,0:26:78   ./.:26,0:26:.   0|0:0,0:29:87   0|0:0,1:29:84   1|1:0,19:21:57  .|.:0,0:23:.    0|0:0,2:23:63   ./.:14,0:14:.   0|0:0,0:20:60   ./.:11,0:11:.   0|0:0,0:15:45   ./.:32,0:32:.   ./.:35,0:35:.
chr22   50808269        .       AG      A       105.55  PASS    AC=2;AF=0.059;AN=34;DP=1217;ExcessHet=0.1296;FS=0;InbreedingCoeff=0.1633;MLEAC=7;MLEAF=0.206;MQ=36.17;NEGATIVE_TRAIN_SITE;QD=30.35;SOR=2.584;VQSLOD=-3.491;culprit=MQ   GT:AD:DP:GQ     ./.:22,0:22:.   0/0:38,0:38:0   ./.:24,0:24:.   ./.:20,0:20:.   ./.:26,0:26:.   ./.:43,0:43:.   ./.:26,0:26:.  ./.:19,0:19:.    0/0:30,0:30:0   ./.:26,0:26:.   ./.:27,0:27:.   0/0:43,0:43:0   0/0:22,0:22:0   ./.:31,0:31:.   ./.:28,0:28:.   0/0:34,0:34:0   ./.:26,0:26:.   1/1:0,2:7:7     0/0:51,0:51:0   0/0:33,0:33:0   ./.:21,0:21:.   ./.:9,0:9:.    0/0:22,0:22:0    0/0:28,0:28:0   0/0:41,0:41:0   0/0:61,0:61:0   ./.:54,0:54:.   0/0:43,0:43:0   ./.:36,0:36:.   0/0:52,0:52:0   ./.:17,0:17:.   0/0:51,0:51:0   ./.:19,0:19:.   0/0:63,0:63:0   ./.:34,0:34:.   0/0:43,0:43:0
chr18   67740193        .       C       CTTTTTTTTTTTTTT 8423.06 PASS    AC=9;AF=0.196;AN=46;DP=616;ExcessHet=0;FS=0;InbreedingCoeff=0.618;MLEAC=10;MLEAF=0.217;MQ=57.96;NEGATIVE_TRAIN_SITE;QD=26.16;SOR=0.956;VQSLOD=-2.121;culprit=MQ GT:AD:DP:GQ     .|.:0,4:15:.    .|.:0,5:15:.    1/1:0,2:5:42    .|.:0,2:14:.    1/1:0,5:6:0     0/0:29,0:29:0   0/0:15,0:15:6  0/0:0,1:6:14     0/0:13,0:13:0   .|.:0,4:12:.    0|0:0,6:26:60   .|.:0,3:11:.    .|.:0,1:6:.     0/0:0,1:4:20    1|1:0,20:25:60  0/0:18,0:18:0   .|.:0,4:15:.    .|.:0,4:18:.    0/0:29,0:29:12  0/0:0,5:15:90   0/0:0,2:6:18    0/0:0,1:6:38   .|.:0,4:15:.     0|0:0,1:17:46   0/0:0,3:9:44    .|.:0,6:12:.    0|0:0,1:18:51   0/0:21,0:21:6   .|.:0,8:16:.    .|.:0,5:17:.    .|.:0,4:14:.    1/1:0,1:1:0     0|0:0,2:27:75   1/0:0,1:5:38    0/0:24,0:24:15  0/0:16,0:16:9

And here is the output of vcftools --missing-site

CHR     POS     N_DATA  N_GENOTYPE_FILTERED     N_MISS  F_MISS  
chr1    10616   72      0       36      0.5  
chr2    120680534       66      0       26      0.393939  
chr22   50808269        72      0       38      0.527778  
chr18   67740193        59      0       13      0.220339

When in reality it should be this

chr1    10616   72      0       36      0.5  
chr2    120680534       72      0       32    0.44  
chr22   50808269        72      0       38      0.527778  
chr18   67740193        72      0       26      0.361
vcftools • 1.7k views
ADD COMMENT
1
Entering edit mode

Why do you think it is wrong with vcftools --missing-site exactly ?

ADD REPLY
0
Entering edit mode

Made edits in the question with examples

ADD REPLY
0
Entering edit mode

Thanks for elaborating. Looks like there is a bug, and the tool is counting only half of the missing allele for unphased genotypes (indicated with .|.). It is counting properly for the phased genotype (./.) however. I don't know if there is another tool to do this, but you could try to update vcftools in case the issue got fixed in a newer version. It should also be relatively easy to write a small script that gives you the proper answer.

ADD REPLY
0
Entering edit mode

Yes, the thing that it does not give a right calculation even for unphased data, so DO NOT use vcftools for missing site calculations :)

I might consider awk instead

ADD REPLY
1
Entering edit mode

Crossposted on stackoverflow.

ADD REPLY
2
Entering edit mode
2.6 years ago
anna ▴ 20

This python code would work for this example:

import sys

infile = sys.argv[1]
try:
    ofile = sys.argv[2]
except IndexError:
    ofile = "/dev/stdout"

with open(infile, 'r') as fin:
    with open(ofile, 'w') as ofile:
        ofile.write("\t".join(["CHR", "POS", "N_DATA", "N_GENOTYPE_FILTERED", "N_MISS", "F_MISS\n"]))

        for line in fin:

            if line.startswith("#"): 
                continue

            line = line.strip().split()
            chrom = line[0]
            pos = line[1]
            missing = 0

            for smp in line[9:]:
                missing = missing + smp.split(":")[0].replace("|", "/").split("/").count(".")

            total = 2 * len(line[9:])

            oline = "\t".join([chrom, pos, str(total), "0", str(missing),  str(round(missing / total, 6)), "\n"])
            ofile.write(oline)

Assuming it's in a file called. e.g. missing_sites.py, it can be invoked by:

python missing_sites.py input.vcf 

or

python missing_sites.py input.vcf  outfile
ADD COMMENT
2
Entering edit mode
2.6 years ago
sbstevenlee ▴ 480

Here's another Python solution using the pyvcf submodule from the fuc package I wrote:

>>> from fuc import pyvcf
>>> import pandas as pd
>>> 
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'C'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['0/0', './.', './.'],
...     'B': ['0/0', '1/1', '1/1'],
...     'C': ['0/1', './.', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/0  0/1
1  chr1  101  .   T   C    .      .    .     GT  ./.  1/1  ./.
2  chr1  102  .   A   C    .      .    .     GT  ./.  1/1  0/0
>>> def one_row(row):
...     m = row[9:].apply(pyvcf.gt_miss).mean()
...     s = pd.Series([row['CHROM'], row['POS'], m])
...     return s
... 
>>> df = vf.df.apply(one_row, axis=1)
>>> df.columns = ['CHROM', 'POS', 'Missing']
>>> df
  CHROM  POS   Missing
0  chr1  100  0.000000
1  chr1  101  0.666667
2  chr1  102  0.333333
ADD COMMENT
1
Entering edit mode
2.6 years ago
4galaxy77 2.9k

Almost certainly faster and easier to use plink2:

plink2 
    --vcf in.vcf \ 
    --missing \
    --out missing.out
ADD COMMENT

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6