Hello All,
I am currently writing some cancer genomics software for crunching on NGS data. I have extracted groups of reads where each read from each group shares a 30bp exact string (sequence) match, with every other read within that group. I also know the position of this match for each read. As such, with these two pieces of information, I already posses the information to align all the reads. I now want to build a consensus sequence, aligned at this position. Does anyone know of software (that I can call from a C++ program, either from inside the program, or externally via command line call) that will allow me to do this? I get the feeling that SamTools is commonly used, namely the program mpileup. However, from their example
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
It looks like the tool requires me to provide a reference genome to align the reads ref.fa
. However, the aln.bam
file should already contain the alignment information. So, I'm a little confused as to why ref.fa
is needed. Does anyone know if this software can be used without a reference? I.e just provide the reads, align against themselves and output the consensus?
It looks like I may have to generate a .bam
from the reads to do this if I wish to use mpileup
, but seeming as I already know the alignment, I may have to generate this artificially.
If anyone could shed some light about the best software to use, when trying to build consensus seq. with data where alignment information is readily acquirable that would be of great help!
Best, Izaak
Samtools by design is a reference-guided consensus tool. What you are looking for is a de novo assembler. Try spades, velvet, abyss, sga or fermi-lite. As a side note, you are really talking about two distinct types of alignments: read-to-reference alignment and read-to-read alignments. Make sure you understand their differences.