Question

Count number of occurrences an annotation in a multisample vcf split into multiple chunks

1

Entering edit mode

3.4 years ago

tacrolimus ▴ 140

Dear Biostars,

I have a NGS data for thousands of people merged into a multisample vcf which has been split into genomic chunks (roughly 1000 chunks). These have been annotated with VEP. One of those annotations is the LOFTEE "high confidence" marker.

I want to count per sample the number of LOFTEE "HC" hits there are per sample across all chunks so the final output would simple be the sample ID and a number representing the total number of high confidence markers that person has.

I have tried:

bcftools view -S $sample_list | \
bcftools +split-vep \
-d \
-f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%Existing_variation\t%LoF\n' \
-i "LOF='HC'" \
-o output_${i}.txt

where $sample file is an array over all the chunks but I am not getting it per sample ID.

Any help would be greatly appreciated.

bcftools vcf annotation • 1.2k views

ADD COMMENT • link 3.4 years ago by tacrolimus ▴ 140

score 3 · Accepted Answer · 2021-07-13

Correct me if I'm wrong, but the LOFTEE annotation depends on the variant and not on the genotype, therefore the LOF='HC' annotation would be inside the INFO column. If that'd the case, I'd suggest a search+split+count strategy rather than a split+search+count one.

Firstly extract LOF='HC' variants to a new multisample file

bcftools view -i ''LOF='HC'" -Oz -o multisample.LOF_HC.vcf.gz multisample.vcf.gz

and secondly split that new file by sample and count private variants in it

for sample in `bcftools query -l multisample.LOF_HC.vcf.gz`; do
 echo -n "$sample "
 bcftools view -c1 -s $sample multisample.LOF_HC.vcf.gz | grep -cv ^#
done > samples.txt

Note that this code is not tested, and that the split loop can take quite a while if the number of samples is large. If that'd be the case, then I'd go for a programmed solution by parsing the file directly:

zgrep -v ^## multisample.vcf.gz \
| perl -lane 'if (/^#/) {
 foreach (@F) { push @h, $_; $s{$_} = 0 }
} elsif ($F[7] =~ /LOF='HC'/) {
 for ($i = 9; $i <= $#F; $i++) {
  ($gt = $F[$i]) =~ s/:.+//;
  if ($gt =~ /[1-9]/) { $s{$h[$i]}++ }
 }
} END {
 foreach (@h[9..$#h]) { print "$_\t$s{$_}" }
}' > samples.txt

(This code has been successfully tested on 1000g multisample files)