From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!
From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!
The following command counts occurence of genotype in a VCF file.
cat test.vcf|grep -oE "([0-2]/[0-2])"|sort | uniq -c
You can filter for a specific locus before :
cat test.vcf|awk '$1==CHROM && $2==POS {print $0}' | grep -oE "([0-2]/[0-2])"|sort | uniq -c
here is a simple perl alternative:
zcat ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
| perl -ane '
/^#/ and next;
%c = ();
foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"
'
I have not tried. But probably it will work. Let me know if it does not work.
grep -v '^#' temp.vcf | perl -lne '@x=split /\s+/,$_;$var11=grep{/1\/1/}@x[9..$#x]; $var01=grep{/0\/1/}@x[9..$#x];$var10=grep{/1\/0/}@x[9..$#x];print "$x[0]\t$x[1]\t$var01\t$var10\t$var11\n"'
Thanks
Priyabrata
Persistent LABS
using bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae
cat ILLUMINA.wex.broad_phase2_baseline.20111114.both.exome.genotypes.1000.vcf |
java -jar src/jvarkit/dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx = iter.next();m={};for(var i=0;i< ctx.getNSamples();++i) {var g=ctx.getGenotype(i);var t=g.getType().name();if(t in m) {m[t]++;} else {m[t]=1;};} out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)); }'
1 69453 {"NO_CALL":748,"HOM_REF":426,"HOM_VAR":3,"HET":5}
1 69486 {"NO_CALL":747,"HOM_REF":434,"HET":1}
1 69496 {"NO_CALL":749,"HOM_REF":429,"HET":4}
1 69511 {"NO_CALL":693,"HOM_VAR":392,"HET":72,"HOM_REF":25}
1 69534 {"NO_CALL":755,"HOM_REF":426,"HET":1}
1 69541 {"NO_CALL":761,"HOM_REF":420,"HET":1}
1 69552 {"NO_CALL":757,"HOM_REF":419,"HET":6}
1 69590 {"NO_CALL":776,"HOM_REF":404,"HET":2}
1 69610 {"NO_CALL":783,"HOM_REF":392,"HET":6,"HOM_VAR":1}
1 69635 {"NO_CALL":789,"HOM_REF":391,"HET":2}
1 69761 {"NO_CALL":886,"HOM_REF":276,"HET":13,"HOM_VAR":7}
1 69768 {"NO_CALL":911,"HOM_REF":270,"HET":1}
1 69897 {"NO_CALL":999,"HOM_VAR":131,"HET":24,"HOM_REF":28}
1 69968 {"NO_CALL":1152,"HOM_REF":29,"HET":1}
1 861275 {"HOM_REF":1181,"HET":1}
1 861292 {"HOM_REF":1179,"HET":3}
, i could not find the jar in the git,
you need to compile the program.
could we print the sample names also in separate columns corresponding to the counts
yes , just loop over the m
object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/keys
I tried getSampleName() in the print statement. It won't do what i need, it is just printing one name for all the variants. Seems there is a little more complicated solution that needs some intuition. Could you help.
out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)+" "+g.getSampleName())
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think the solution provided would tell total 0/1 or 1/0 count per vcf file. He wants how many samples have 0/0, 0/1, 1/1 genotypes on a given locus. So suppose there are 10 samples in a vcf file, then there would be 10 genotypes at each locus corresponding to 10 sample. So 0/0, 0/1, 1/1 distribution has to be wrt 10 samples.
It does work (it summarizes) but I assume OP wanted counts per line/variant.
could it be possible to print the sample names as well?