Hi,
I have converted my sorted aligned BAM file (paired-end short reads aligned to reference) to a consensus fasta file using the smalls mpileup, vcfutils and smalls seqtk pipeline, but I have some non standard nucleotides (i.e. not A,C,G,T) in my fasta file. I assume that these are artifacts of the BAM file. For instance, I have normal nucleotide sequence, and then I'll have a 'W', or a 'K'. Are these following the IUB/IUPAC nomenclature, or are they something different?
Thanks
I actually wonder if those were inserted downstream of the BAM file, likely by vcfutils or seqtk. In order to have IUPAC codes in a BAM file either (1) the original read had to contain that code of (2) the aligner had to produce it. Both of those seem unlikely.
Very good point, thank you. I'll investigate the conversion process.
It seems it happens in the conversion of the BAM to the fastq. I guess this could be due to SNPs/ploidy and it can't decide what the consensus nucleotide is.
It looks like vcfutils is adding the IUPAC codes then: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfutils.Pl
That is incredibly handy, thanks for posting that link!
No problem, you eliminated the other likely possibility!