How to count SNPs, InDels
3
3
Entering edit mode
9.6 years ago
chchl7 ▴ 40

I have used Bowtie 2 to align my reads to reference genome and Samtools to call variances (SNPs and InDels). Does anyone know how to count how many SNPs and InDels I got? Thanks!

next-gen sequencing snp genome • 13k views
ADD COMMENT
0
Entering edit mode

in a BAM or in a VCF ?

ADD REPLY
4
Entering edit mode
9.6 years ago
Vivek ★ 2.7k

You could write a quick awk line ignoring multi-allelic loci:

SNPs:

awk '! /\#/' variants.vcf | awk '{if(length($4) == 1 && length($5) == 1) print}' | wc -l

Indels:

awk '! /\#/' variants.vcf | awk '{if(length($4) > 1 || length($5) > 1) print}' | wc -l

but for something more comprehensive, use bcftools stats.

ADD COMMENT
1
Entering edit mode

Alex's solution is faster and better.

ADD REPLY
0
Entering edit mode

Thank you very much! The scripts are working well. How can I use awk to count Insertions and deletions separately? Thank you!

ADD REPLY
1
Entering edit mode

insertions:

awk '! /\#/' variants.vcf | awk '{if(length($4) > 1 ) print}' | wc -l

deletions:

awk '! /\#/' variants.vcf | awk '{if(length($5) > 1) print}' | wc -l
ADD REPLY
0
Entering edit mode

Wrong. If the variation is REF=A ALT=T,G or if the variation is symbolic, e.g., ALT=<cnv100>

ADD REPLY
0
Entering edit mode

The post above said ignoring multi-allelic. So I am assuming it's biallelic.

ADD REPLY
0
Entering edit mode

your awk script is wrong for multi-allelic sites: like `REF=A ALT=T,G`

ADD REPLY
3
Entering edit mode
9.6 years ago

Via BEDOPS convert2bed:

$ vcf2bed --snvs < foo.vcf | wc -l
$ vcf2bed --insertions < foo.vcf | wc -l
$ vcf2bed --deletions < foo.vcf | wc -l
ADD COMMENT
0
Entering edit mode
4.3 years ago

You can also use the summary from bcftools stats to answer this question.

http://samtools.github.io/bcftools/bcftools.html#stats

ADD COMMENT
0
Entering edit mode

Hi Charles,

I crossed your post about deepvariant caller and think you can advise me on a bioinformatic pipeline for analyzing genetic data. We generated DNA sequencing by using TrueSight Cancer 94, Illumina, and were not sure how we should analyze them, from aligment to variant calling to annotate to variant classification, and how filter variant. I would appreciate if you can help me. Thanks for your time and consideration!

Chinh

ADD REPLY
0
Entering edit mode

This might help you!

ADD REPLY
0
Entering edit mode

Hi - I think you meant this as a comment for my other answer.

Also, I think you might be referring to this blog post:

http://cdwscience.blogspot.com/2019/05/precisionfda-and-custom-scripts-for.html

My sense is that it helps to be able to visually inspect variation calls, and I am not sure I would be comfortable with a result if you couldn't.

I have some sense that you might need to take some extra caution with targeted gene panels, which I think is a slightly different blog post (even though it is relevant to the idea that I thought the default set of DeepVariant calls could benefit from some additional filtering, kind of like the unfiltered GATK calls):

http://cdwscience.blogspot.com/2019/06/general-comments-on-low-frequency.html

However, to be honest, I don't have specific advice for the TrueSight Cancer 94 gene panel.

ADD REPLY

Login before adding your answer.

Traffic: 2105 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6