I have genomic reads of a resequencing project of a bacterial genome and want to determine the differences between the reference genome and our genome (which should be very similar) and get the corrected sequence of the genome. To generate a VCF file I did something like this:
samtools \
mpileup \
-B \
-u \
-f example_reference_seq.fa \
example_read_mappings.bam \
| bcftools view -cv - \
> my_snps.vcf
There are some variations listed in the VCF file with a high quality score (some over 100) and a visual inspection of the BAM file in a genome browser confirms these variations. Some are in nearly every read covering the considered position so would be expected candidates for a changed nucleotide. The next step was to generate a consensus sequence based on the reference sequence and the mappings in which the differences are taken into account.
samtools \
mpileup \
-B \
-u \
-f example_reference_seq.fa\
example_read_mappings.bam | \
bcftools view -cg - | \
vcfutils.pl vcf2fq \
> example_consensus.fq
seqtk fq2fa -l 70 example_consensus.fq > example_consensus.fa
The outcome of this call contained some surprises. I cannot completely follow how vcfutils.pl vcf2fq performs the consensus calling and the my Perl knowledge is a little bit rusty (plus the variable names are not really helpful). There are locations in the resulting consensus sequence where the nucleotides are changed to lower case (I think to show lower confidence). There is a change from "C" in the reference genome to "S" in the new consensus sequence. Looking at the position in the a genome browser show that 65 % C and 35% G (rest - other). I found in the code of vcfutils.pl in the function vcf2fq the following hash but I don't know why this "S" is introduced.
my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y',
GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K');
Additionally I don't understand which entries are used to perform changes in the references sequence. Judging from the code the "AF1" and the "FQ" flag values are critical. FQ ("Phred probability of all samples being the same") makes sense to me but I don't understand it completely (it seem make a difference if is positive of negative).
So what is my question/my problem? I would appreciate if somebody could shed light on the processes of the consensus generation. Which entries in the VCF file would correspond to changes in the consensus sequences in comparison to the reference sequence. Why is the "S" introduced and what would be the usual procedure (e.g. replacing by C or G) to make the sequence usable further steps.
Thanks, Martjin. I was aware of the minimal (and maximal) depth parameter but not that I can use the -s parameter for the purpose of defining the ploidy. I will give it a try.