Hi,
I am trying a variant calling pipeline with BWA and Samtools. Somehow when I have more sequences to align to references the output is less variant and I lose the Indels calls (even though the sequences carrying them are still present in the extended list). Could it be due to default parameters only calling variants if they are represented by a minimum % of the population?
Here is my pipeline:
Index:
bwa index ref_file
Align:
bwa mem ref_file all_genomes_file > alignement_file.sam
SAM to BAM:
samtools view -S -b s alignement_file.sam > alignement_file.bam
Sort:
samtools sort alignement_file.bam -o sorted.bam
Variant calling:
bcftools mpileup -B -f sorted.bam | bcftools call -mv --multiallelic-caller --variants-only > var.raw.vcf
the output of bwa can be piped into samtools sort
Not only that sort can read SAM, so there is no point in converting SAM to BAM first (especially when you're compressing the BAM only to immediately decompress it).
It's sad that this meme of pointless format conversions still exists in workflow blogs all over the place. I wonder how many CPU cycles get wasted.
True. As a side note don't forget that both commands would share the same cpu, therefore if you have say 8 cores available and you use a command like this one
then both commands would compete for the same cores. In that case it would be wiser to divide them like this