Greetings,
I have stumbled upon a problem with SNP discovery process with samtools/bcftools pipeline. We have introduced an N ambigious nucleotide in every position of our reference where at least one of our sample had variant site, so we could mapp also distant species and receive set of SNP for those N positions. Unfortunately we have not found and option to keep ambiguous sites with bcftools view option, only bcftools call option allows to keep such position. With second option there is a serious glitch in the program. Even ratios 1:100 between alternate alleles are considered to be ok and there is no option to filter them out, vcfutills filters out also those N position without any options to keep them. Thank you for your ideas.
Best regards
Jan Roslein
how about calling ALL positions of the ref (don't output only variants ) ?
Thank you for your answer. Unfortunately problem not with output of bcftools vcf format since it does write major, alternate and minor SNP in contrast to reference even if it is N, but heterozygosity is not solved (no likelihood or pVal). GATK's output gives you all positions but when reference == ! 'N' it considers it mistake (authors claim, current version does not support N in reference). I tried bcftools, GATK and freebayes nothing solves the problem...
You should try the RTG pipeline, we recently added support for calling at sites where the reference is N (either single positions or longer blocks). We do not compute QUAL or GL in cases where the reference allele is N (since these are not well defined in this case), but do compute GQ to score the likelihood of the call in comparison to the non-N hypotheses considered.
Wouldn't the simplest solution be to map to the raw reference instead of masking it with Ns?