Number of samples and variants in .vcf file
3
1
Entering edit mode
2.2 years ago
bluebubbl19 ▴ 10

How can I count the number of variants (lines) and samples in a vcf file using simple bash commands (not using vcftools, bcftools, gatk or another packages)?

Can I use wc -l for variants? How can I get rid of the lines which do not represent variants? Is there any similar wc command that would give me the number of samples?

Each of my chromosomes is one file.

Thank you!

vcf • 2.8k views
ADD COMMENT
1
Entering edit mode
2.2 years ago
4galaxy77 2.9k

The number of variants is given by

zcat in.vcf.gz | grep -v '#' | wc -l

The number of samples is non trivial if it doesn't contain genotypes, but if it does then try:

zcat in.vcf.gz | grep -v '#' | head -n1 | sed 's/\t/\n/g' | grep '/' | wc -l

That said, I can't imagine any reasons why you wouldn't just use bcftools stats which is way more robust, simpler, and gives you a ton more information. Don't resort to regex tricks when you can use a dedicated tool like bcftools.

ADD COMMENT
1
Entering edit mode
2.2 years ago

number of variants:

zcat in.vcf.gz |  grep -c '^[^#]'

number of samples:

zcat in.vcf.gz | grep -m1 "^#CHROM" | cut -f 10- | tr "\t" "\n"  | wc -l
ADD COMMENT

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6