Question

Bacterial SNP calling pipeline with SNP and indels confirmation

0

Entering edit mode

8.2 years ago

biotech ▴ 570

The outputs of my experiment would be getting SNPs and gaps for a set of 4 bacterial phenotypic variants compared to reference.

Some posts already exist here asking for SNP calling, with many approaches suggested.

Best Practice On Variant Discovery For Bacteria? , Variant Discovery In Bacteria , SNP calling from multiple bacterial genomes

Any of them would be appropriate as long as identifying these features. What we really need is to confirm SNPs and indels before doing PCR validation. We thought about doing it visually via IGV or similar tools. We need to see which DNA base is on each strain at the SNP position since could be due to false positive in reference due to sequencing error (we have another reference strain that is also non-variant, and would be used as control).

Inputs we have

PacBio, assembled and annotated reference 1
Illumina assembled and annotated reference 2, variants 1-4

What I am asking here is about the suggested methodology and the confirmation detail approach.

Thanks, Bernardo

bacteria SNPs Indels • 4.7k views

ADD COMMENT • link updated 8.2 years ago by Calvin ▴ 80 • written 8.2 years ago by biotech ▴ 570

score 1 · Answer 1 · 2016-09-10

For bacteria, to my knowledge there is no confirmed or well-defined methodology. The "best" method choice always comes after you investigate your data and tried various ways, at least in my case, in order to get as many true positive SNPs or SV as you can. But eventually, No matter what method you use, it is important to find way to verify why or how your true positive results are convincing. I am not so clear what is your input exactly? you mean your variants 1-4 each has PacBio and illumina reads?
for SNPs calling, I do both Bowtie2 and BBmap to get sam file. and then convert them to bam before being MarkDuplicates by picard and being sorted by samtools and finally use samtools to give each bam file BAQ score. Using freebayes (worth trying) to generate VCF files for each bam and then do some filtering analysis. You can choose bcftool isec to merge the VCF file common in four variants from those two alignment tools and generate unique SNPs in each variants. and then you can use remaining VCF file to find SNPs that can cause Amino Acid change through SnpEff to further cut down the amount of SNPs.

This protocol above is roughly how i deal with my bacteria thing and it is not ordinary way to do variant calling for the reason that i only have illumina sequence data and for some technique reason the variant has very low coverage reads (most of base has only one coverage) so that i can't rely on Read depth per Base or QUAL to do filtering.

Anyhow, It is always good to start with hybrid de novo assembly for each variants as this may help you find structural variation and may able to double confirm the SNPs you found earlier and then use Mauve or ACT to visualize Structural variation.

Here is one paper you might found Useful.