I have mutliple genome consensus sequences, extracted from a BWA BAM file using the following commandline:
samtools mpileup -E -q30 -Q30 -uf REF.fa isolate1.merged.bam | bcftools view -cg - |\
vcfutils.pl vcf2fq -d3 > isolate1.fq (which I convert to FASTA using a python script).
(Samtools-0.1.18 used)
Whether or not I invoke the -d option in vcfutils.pl (default is read-depth 3), sites with a read depth of 1 or 2 are still called, just lowercase instead! I'd rather the base at these sites was not called ie 'N'.
This means that for each bacterial genome, I have approximately 48000 sites where a base has been called with a low read depth that I'd consider to be low quality. This includes approximately 650 variant sites that could be false positive SNPs (if they are sequencing errors covered by 1 read only!) and therefore potentially skew our phylogenetic analysis.
Is it possible to convert these lowercase letters to 'N' ie an uncalled base? I can only find tools that convert lowercase to uppercase and viceversa, I can't find any way to convert these low quality letters to 'N'? Of course my FASTA file is a mixture of uppercase bases and lowercase!
Failing that, perhaps does anyone know of a way I could call the consensus sequence from either the BAM file or the VCF file I have (for all sites), where it does filter out for minimum read depth?
Any help much appreciated!
Best Wishes,
Laura
tr/// is better than s/// here
Masha, I agree, tr/// is faster for simple replacements than s///.
use tr/acgt/N/ instead of s/[acgt]/N/ in the code