I'd like to extract consensus of reads from a bam/sam file which would follow particular criteria:
- The portions of reference which have 0 coverage are represented in the consensus as N
- If there is at least single read mapping to a particular nucleotide, this nucleotide is retrieved in a consensus.
- If there is conflicting evidence from reads, the nucleotide with highest frequency (lets say 75%) is retrieved in consensus. If there is no nucleotide with >= 75% frequency, N is retrieved in the consensus.
Is there a tool which would enable this or similar level of fine-tuning of the consensus extraction from bam/sam?
What I tried: I used samtools, bcftools and vcfutils to get consensus:
samtools mpileup -uf reference.nt mapping.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fasta
However, the consensus I'm getting this way contains a substantial proportion of the reference sequence and I can't find parametres which could modify how the consensus is calculated.
Yet another approach - extracting consensus using IGV viewer (option Copy consensus sequence) gives me basically the consensus I'm looking for but there is apparently no way how to automatize this for hundreds of reference sequences which I have.
Note: I posted this question also as a comment to my original question Map eukaryotic genomic reads to transcript reference of closely related organism however I'm posting it here separately as it diverged from my original question.
Hi al-ash. I am looking for similar answer for your 3rd question but instead of N I just want to have reference base only. Could you figure how to do that? Perhaps you can share if you had found some way. Thanks.
Hi Prakki, that sounds a bit more suitable task for the combination of samtools and bcftools I gave above (after possibly tuning the parametres) - did you try to play with these tools?
Or you might give a try to e.g. FastaAlternateReferenceMaker (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php) which "replaces the reference bases at variation sites with the bases supplied in the corresponding callset records".
I do not know your exact requirements for the consensus but my main question was to make sure that actually no bases from reference are included in the consensus. On the other side, your requirements should be easier to fulfill by some of the tools above as they are designed to obtain this type of consensus you require.
Hi al-ash. Thanks for comment. Yes. I tried samtools and bcftools. But, could you manage to set the cut off of 75% frequency? The nearest parameter I could come across is
--max-maf 0.25
in the vcftools. Is that what you used as well for cutoff? In my case, I am okay with the reference included in consensus.Prakki, unfortunately I can't help you here since I was not testing the vcftools parametres. I did not yet manage to fine-tune the extraction of consensus, but as I wrote, that is mainly because I do not want to generate consensus which would have the nucleotide bases of reference at the position where no variant was called - I want to have there Ns whenever the identity of nucleotide at that position is well supported by read evidence. In your case, however, vcftools parametres, as you suggest, might solve your problem.
That's fine al-ash. Thanks for your comments! I am trying to figure out!