Question

Genotype count in a VCF file

0

Entering edit mode

8.8 years ago

akang ▴ 120

From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!

SNP vcf • 8.5k views

ADD COMMENT • link updated 8.8 years ago by Jorge Amigo 14k • written 8.8 years ago by akang ▴ 120

score 2 · Answer 1 · 2016-10-24

2

Entering edit mode

8.8 years ago

sacha ★ 2.4k

The following command counts occurence of genotype in a VCF file.

  cat test.vcf|grep -oE "([0-2]/[0-2])"|sort | uniq -c

You can filter for a specific locus before :

  cat test.vcf|awk '$1==CHROM && $2==POS {print $0}' | grep -oE "([0-2]/[0-2])"|sort | uniq -c

ADD COMMENT • link 8.8 years ago by sacha ★ 2.4k

0

Entering edit mode

I think the solution provided would tell total 0/1 or 1/0 count per vcf file. He wants how many samples have 0/0, 0/1, 1/1 genotypes on a given locus. So suppose there are 10 samples in a vcf file, then there would be 10 genotypes at each locus corresponding to 10 sample. So 0/0, 0/1, 1/1 distribution has to be wrt 10 samples.

ADD REPLY • link 8.8 years ago by Persistent LABS ▴ 750

0

Entering edit mode

It does work (it summarizes) but I assume OP wanted counts per line/variant.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

could it be possible to print the sample names as well?

ADD REPLY • link 7.2 years ago by hellbio ▴ 520

score 2 · Answer 2 · 2016-10-25

2

Entering edit mode

8.8 years ago

Jorge Amigo 14k

here is a simple perl alternative:

zcat ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
| perl -ane '
/^#/ and next;
%c = ();
foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"
'

ADD COMMENT • link 8.8 years ago by Jorge Amigo 14k

Jorge Amigo · Answer 3 · 2016-10-24

1

Entering edit mode

8.8 years ago

Persistent LABS ▴ 750

I have not tried. But probably it will work. Let me know if it does not work.

grep -v '^#' temp.vcf | perl -lne '@x=split /\s+/,$_;$var11=grep{/1\/1/}@x[9..$#x]; $var01=grep{/0\/1/}@x[9..$#x];$var10=grep{/1\/0/}@x[9..$#x];print "$x[0]\t$x[1]\t$var01\t$var10\t$var11\n"'

Thanks

Priyabrata

Persistent LABS

ADD COMMENT • link updated 8.8 years ago by Jorge Amigo 14k • written 8.8 years ago by Persistent LABS ▴ 750

0

Entering edit mode

Although I haven't checked whether it's correct, it does work ;)

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

score 1 · Answer 4 · 2016-10-24

1

Entering edit mode

8.8 years ago

Pierre Lindenbaum 166k

using bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

cat ILLUMINA.wex.broad_phase2_baseline.20111114.both.exome.genotypes.1000.vcf | 
java -jar src/jvarkit/dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx = iter.next();m={};for(var i=0;i< ctx.getNSamples();++i) {var g=ctx.getGenotype(i);var t=g.getType().name();if(t in m) {m[t]++;} else {m[t]=1;};} out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)); }'

1 69453 {"NO_CALL":748,"HOM_REF":426,"HOM_VAR":3,"HET":5}
1 69486 {"NO_CALL":747,"HOM_REF":434,"HET":1}
1 69496 {"NO_CALL":749,"HOM_REF":429,"HET":4}
1 69511 {"NO_CALL":693,"HOM_VAR":392,"HET":72,"HOM_REF":25}
1 69534 {"NO_CALL":755,"HOM_REF":426,"HET":1}
1 69541 {"NO_CALL":761,"HOM_REF":420,"HET":1}
1 69552 {"NO_CALL":757,"HOM_REF":419,"HET":6}
1 69590 {"NO_CALL":776,"HOM_REF":404,"HET":2}
1 69610 {"NO_CALL":783,"HOM_REF":392,"HET":6,"HOM_VAR":1}
1 69635 {"NO_CALL":789,"HOM_REF":391,"HET":2}
1 69761 {"NO_CALL":886,"HOM_REF":276,"HET":13,"HOM_VAR":7}
1 69768 {"NO_CALL":911,"HOM_REF":270,"HET":1}
1 69897 {"NO_CALL":999,"HOM_VAR":131,"HET":24,"HOM_REF":28}
1 69968 {"NO_CALL":1152,"HOM_REF":29,"HET":1}
1 861275 {"HOM_REF":1181,"HET":1}
1 861292 {"HOM_REF":1179,"HET":3}

ADD COMMENT • link 8.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre, could we print the sample names also in separate columns corresponding to the counts? BTW, i could not find the jar in the git, is it moved to a different location or name?

ADD REPLY • link 7.2 years ago by hellbio ▴ 520

0

Entering edit mode

, i could not find the jar in the git,

you need to compile the program.

could we print the sample names also in separate columns corresponding to the counts

yes , just loop over the m object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/keys

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I tried getSampleName() in the print statement. It won't do what i need, it is just printing one name for all the variants. Seems there is a little more complicated solution that needs some intuition. Could you help.

out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)+" "+g.getSampleName())

ADD REPLY • link 7.2 years ago by hellbio ▴ 520