Hello All,
I'm losing my mind and I'm sure its because I'm misunderstanding something or not doing something correctly. The project background is I'm trying to get a consensus of a variant virus. Its a small coronavirus only 27.6kbp. and we generated this data with the ion torrent.
The steps I have done so far.
1. Aligned to host and removed only reads that didn't map.
2. Use fastx toolbox to select only the high quality reads, avg Q33 for all the read.
3. Used TMAP mapall to align to the RefSeq Genome from genbank, with the lowest level of stringency. This gave me a complete genome map.
4. Using samtools mpileup -uf ref aln.sorted.bam | bcftools call -mv -Oz -o calls.vcf.gz to make vcf file
5. tabix to index
6. bcftools consensus call.vcf.gz > viral_cns.fa
then to double check my work I use the generated sequence as the new template, with stringency set to the default. The idea being that I would get a few SNPs but not many if it was done correctly.
The output I get from this is the same as the low stringency output when I look at in IGV. LOADS of SNPs.
My question is, why is this happening? What I am I not understanding about this process? Most of these variants are within 5% of each other over the whole genome so this should be that hard.
Thank you in advance.
Sean
Align the consensus sequence against the original sequence and investigate that - bwa mem or lastz can do it.
That does show the snps being where I they were called. Thank you :)