hi there, is there a way to get count of SNP, indels, CNVs etc from a VCF file, so some thing like
SNPs = ?
Insertions = ?
Deletions = ?
CNVs = ?
using simple linux commands
thanks, a
hi there, is there a way to get count of SNP, indels, CNVs etc from a VCF file, so some thing like
SNPs = ?
Insertions = ?
Deletions = ?
CNVs = ?
using simple linux commands
thanks, a
There are a couple of ways that variant type is annotated within a VCF file, so there are correspondingly a few ways to get close to what you want. Here's one choice that should work with most VCF files:
Use the vcftools
tool vcf-annotate
to fill in the variant type field:
zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type > out.vcf
Then count up the variants by looking at the (newly-filled) TYPE field:
grep -oP "TYPE=\w+" out.vcf | sort | uniq -c
Or in one step that doesn't change the original VCF file:
zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type | grep -oP "TYPE=\w+" | sort | uniq -c
On an example I had, this yielded:
3410 TYPE=del
4487 TYPE=ins
56744 TYPE=snp
1000 Genomes VCF files will be annotated in a finer-grained way (e.g. choices including DUP, INV, CNV, TANDEM, see here), but I'm not sure how to get their range of annotations from your own raw read data. However, if these distinctions are critical to you, that may be a useful direction to explore.
bcftools has a reporting tool that gives you this kind of information:
bcftools stats file.vcf > file.stats
if you want a fancy output, here is a development based on the top answer above with some gawk magic
cat <my.vcf> \
| vcf-annotate --fill-type \
| grep -v '^#' \
| gawk '
BEGIN{
FS="\t"; OFS="\t"
}
{
match($8, /TYPE=([^ ]+)/, arr);
type=arr[1];
len=sqrt((length($5)-length($4))^2);
cnt[type]++;
tot[type]+=len
}
END{
print"# differences between the query and reference include:"
printf "%s: %i (%i bases)\n", "snp", cnt["snp"], cnt["snp"];
printf "%s: %i (%i bases)\n", "del", cnt["del"], tot["del"];
printf "%s: %i (%i bases)\n", "ins", cnt["ins"], tot["ins"]
}'
rem: it is simplistic and may not work well if your VCF has overlapping alt calls (not tested!)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is so helpful! Thank you!
Great solution, after 8,5 years. Worked like a charm; zcat in.vcf.gz | vcf-annotate --fill-type > out.vcf