Output per variant and per sample heterozygosity fraction from VCF.
2
0
Entering edit mode
8.7 years ago
William ★ 5.3k

As a QC measure I would like to know the per variant and per sample heterozygosity fraction.

I already used vcftools to output the missingness per variant and sample.

https://vcftools.github.io/man_latest.html

Is there any tool that can do the same for heterozygosity?

vcf qc • 5.2k views
ADD COMMENT
0
Entering edit mode

Using vcftools --het ? According to the manual this option calculates a measure of heterozygosity on a per-individual basis.

ADD REPLY
0
Entering edit mode

This functionality seems to be broken in my installed version of vcftools (v0.1.14 (= the latest?) ) . It says all variants and samples are included but outputs a file called out.het with a header " INDV O(HOM) E(HOM) N_SITES F " and no records. I am also still looking for the per site heterozygosity .

ADD REPLY
3
Entering edit mode
8.7 years ago

Here is a VCFLIB solution:

  ./genotypeSummary -f ../samples/1kg-phaseIII-v5a.20130502.genotypes.chr22-16-16.5mb.vcf.gz -y GT -t 0,1,2,3,4

Here is the output:

sample-id   n-nocall    n-hom-ref   n-het   n-hom-alt
HG00096 0   5899    133 49
HG00097 0   5855    149 77
HG00099 0   5909    130 42
HG00100 0   5864    167 50
HG00101 0   5884    132 65
ADD COMMENT
0
Entering edit mode

I get an error about a unknown genotype. The file is produced by Freebayes. but might contain ./. and . as missing genotypes? genotypeSummary -f my_file.vcf.gz -y GT -t 0,1,2,3,4,5 FATAL: unknown genotype: .

Also I would like to do this for a large set of samples so a parsing for a range of targets might be usefull. i.e.: -t 1-100

ADD REPLY
0
Entering edit mode

William Thanks for reporting the error. I just fixed the genotype parser to allow both '.' and './.' for genotypes. Let me know if the most recent version generates the same error; I don't have a test file handy. The range parsing would be ideal. I'll look into that. Feel free to open an issue on github.

ADD REPLY
2
Entering edit mode
8.7 years ago

using bioalcidae : https://github.com/lindenb/jvarkit/wiki/BioAlcidae

var sample2count={};
var samples = header.getSampleNamesInOrder();
for(var i=0;i< samples.size();i++) {
sample2count[samples.get(i)]={
    "nocall":0,
    "homref":0,
    "homvar":0,
    "hetnonref":0,
    "het":0,
    };
}

while(iter.hasNext()) {
    var ctx=iter.next();
    for(var i=0;i< samples.size();i++) {
        var count = sample2count[samples.get(i)];
        var  g = ctx.getGenotype(samples.get(i));
        if( !g.isCalled()) count["nocall"]++;
        else if( g.isHomRef()) count["homref"]++;
        else if( g.isHomVar()) count["homvar"]++;
        else if( g.isHetNonRef()) count["hetnonref"]++;
        else if( g.isHet()) count["het"]++;
        }
    }
 out.println("Sample nocall homref homvar hetnonref het");  
 for(var i=0;i< samples.size();i++)
 {
 var count = sample2count[samples.get(i)];
 out.println(samples.get(i)+" "+ count["nocall"]+" "+count["homref"]+" "+ count["homvar"]+" "+count["hetnonref"]+" "+count["het"]);
 }

run:

 curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | gunzip -c  |\
head -n 1000 | \
java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -f script.js -F vcf 

Sample nocall homref homvar hetnonref het
HG00096 0 729 4 0 14
HG00097 0 712 7 0 28
HG00099 0 715 2 0 30
HG00100 0 711 13 0 23
HG00101 0 708 16 0 23
HG00102 0 714 7 0 26
HG00103 0 720 10 0 17
HG00105 0 720 2 0 25
HG00106 0 704 12 0 31
HG00107 0 707 14 0 26
HG00108 0 719 17 0 11
HG00109 0 707 18 0 22
HG00110 0 726 10 0 11
HG00111 0 724 14 0 9
HG00112 0 712 18 0 17
HG00113 0 729 3 0 15
HG00114 0 724 6 0 17
HG00115 0 713 22 0 12
HG00116 0 718 9 0 20

ADD COMMENT
0
Entering edit mode

this gives the below error:

/tmp/jvarkit8429619995082938580.tmp/BioAlcidaeJdkCustom821487873.java:26: error: illegal start of expression
sample2count[samples.get(i)]={
                         ^
/tmp/jvarkit8429619995082938580.tmp/BioAlcidaeJdkCustom821487873.java:27: error: not a statement
"nocall":0,
^
/tmp/jvarkit8429619995082938580.tmp/BioAlcidaeJdkCustom821487873.java:27: error: ';' expected
"nocall":0,
        ^
3 errors
[SEVERE][BioAlcidaeJdk]java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile
at 
com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$AbstractHandlerFactory.getConstructor(BioAlcidaeJdk.java:906)
at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$VcfHandlerFactory.execute(BioAlcidaeJdk.java:937)
at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.doWork(BioAlcidaeJdk.java:1377)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:777)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:940)
at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.main(BioAlcidaeJdk.java:1402)
  Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:266)
at 
  com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$AbstractHandlerFactory.getConstructor(BioAlcidaeJdk.java:898)
... 5 more
 Caused by: java.lang.RuntimeException: Cannot compile
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.exec(OpenJdkCompiler.java:180)
at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:243)
... 6 more
ADD REPLY
0
Entering edit mode

you're using bioalcidaejdk not bioalcidae

ADD REPLY

Login before adding your answer.

Traffic: 1408 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6