how to count variants par sample per chromosome in a vcf file?
7
1
Entering edit mode
6.2 years ago
nagarsaggi ▴ 40

I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram

snp • 15k views
ADD COMMENT
1
Entering edit mode

Use vcfstats function from Rtgtools per sample stats

ADD REPLY
3
Entering edit mode
6.2 years ago
sacha ★ 2.4k

Split your vcf file by sample and count how many times chromosom appear in each file .

FILE=yourfile.vcf
for sample in `bcftools query -l $FILE`
do 
      bcftools view -c1 -H -s $sample -o ${sample}.vcf $FILE
      cat ${sample}.vcf |cut -f1|uniq -c > ${sample}.count
done
ADD COMMENT
4
Entering edit mode
6.2 years ago
JC 13k

You can use a bash command like:

gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c

Explanation:

  • gunzip -c myfile.vcf.gz -> decompress the VCF
  • grep -v "#" -> removes all header lines
  • cut -f1 -> keeps only the first column (which is the chromosome id)
  • uniq -c -> counts unique elements in a list
ADD COMMENT
0
Entering edit mode

Thank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.

ADD REPLY
0
Entering edit mode

you can extract samples per individual with VCFtools, then use the same strategy to count called variants

ADD REPLY
0
Entering edit mode

Yes, I tried this but still get same number of variant per site for all individual samples.

ADD REPLY
0
Entering edit mode

across the samples.

this was not specified in your original question. How are we supposed to know it ?

I updated the question.

ADD REPLY
2
Entering edit mode
5.1 years ago

With plink2:

plink2 --vcf <VCF filename> --out converted
plink2 --pfile converted --sample-counts --chr 1 --out chr1_results
plink2 --pfile converted --sample-counts --chr 2 --out chr2_results
...

The --sample-counts flag provides essentially the same information as bcftools stats -s, except the implementation is ~1000x as efficient.

ADD COMMENT
0
Entering edit mode

Thank you for the wonderful response above about using plink2.

I tried your code above, but I got the error that there is no such file converted.psam? How do I get psam file from the first command line plink2 --vcf <VCF filename> --out converted?

ADD REPLY
0
Entering edit mode

What happens if you add --make-pgen to the first command line?

ADD REPLY
0
Entering edit mode

This is by far the best approach to this problem. SOOO FAST!

By the way, to know the number of variants, we just have to sum columns "HOM_ALT_SNP_CT" and "HET_SNP_CT", right?

ADD REPLY
1
Entering edit mode
6.2 years ago

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  otavirus_rf.vcf.gz

RF07    S4:1
RF07    S3:2
RF05    S2:4
RF07    S2:2
RF11    S4:1
RF03    S5:2
RF09    S1:1
RF09    S2:1
RF01    S5:1
RF09    S3:1
RF03    S2:4
RF03    S1:1
RF09    S5:1
RF05    S3:4
RF03    S4:2
RF07    S5:1
RF05    S4:2
RF03    S3:4
RF06    S3:1
RF06    S2:1
RF04    S1:3
RF06    S1:1
RF10    S4:2
RF10    S1:1
RF02    S4:1

details:

stream(). /* get a stream of variant */
    flatMap( 
        V->V.getGenotypes(). /* get the genotypes */
            stream(). /* convert to a stream of genotypes */
            filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
            map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
        ).
    collect( 
        Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
        ).forEach((K,C)->println(K+" : "+C)); /* print */
ADD COMMENT
1
Entering edit mode
5.1 years ago

Was searching for the same answer and sorta figured out an approach that works for me

Using a combo of bcftools and UNIX you can get the count of alternative positions per indvidual

bcftools query -i'GT="alt"' -f'[%SAMPLE %GT \n]' 1365_filtered.01.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts

If you want to get this by chromosome you could include the "-r" parameter when running bcftools, or add '%CHROM\' to the -f parameter and use UNIX to fitler posthoc.

The nice thing about this approach is s you can explictly count the number of het/alt/hom/miss by chaing the "GT=" option.

More on bcftools here.

ADD COMMENT
1
Entering edit mode

Please remember that bcftools can be unpredictable with multi-allelic sites. For each bcftools command, one would need to understand how that command treats multi-allelic sites to ensure their results are accurate.

Run bcftools norm (or even better, vt-norm/vt-decompose) before using bcftools.

ADD REPLY
0
Entering edit mode
6.2 years ago
prasundutta87 ▴ 670

BCFtools stats with any one of the below two options is what you need..

 -s, --samples <list>               list of samples for sample stats, "-" to include all samples
 -S, --samples-file <file>          file of samples to include

Example:

bcftools stats -s - <multisample VCF file>
ADD COMMENT

Login before adding your answer.

Traffic: 2861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6