hi I'm learning and doing an exercise on variant calling in e coli I'm an amature and need help I started with 6 files fastq format for ecoli from a study in ena i also downloaded the genome file fasta format for that particular strain i did the follwing in the sequence fastq runs, trimmomatic, qc on trimmed data, BWA index on the fasta file, samtools faidx, samtools dict, align trimmed data to the reference using bwa mem (i get a sam file), convert, sort, index using samtools, next I used GATK to mark duplicates, GATK to add or replace groups,
next im supposed to use GATK BaseRecalibrator for which i need a known site reference for polymorphisms in ecoli this is a vcf file
how am i supposed to get this file or arrive at the step
the ecoli strain I'm looking at is ecoli rel606
With bacteria you can use a simple variant calling procedure (without the base recalibration). Use
callvariants.sh
from BBMap suite as an alternative.There are lots of pipelines and software for this, but as @genomax says, GATK is overly complex and not well suited.
Your pipeline sounds good so far, I'd recommend freebayes or samtools snp-calling, else maybe snippy as a pipeline.
Thank you genomax!! I'll explore BBMap the reason for using GATK was because this was taught in class for human samples and i was trying the same with ecoli