Given the current obsession with variant studies, is anyone working on an aligner that circumvents the SAM->BAM->pileup->BCF->VCF pipeline and its associated huge files and processing time, but instead simply internally tabulates SNPs and indels and produces a single sample VCF (and maybe a stack of oddball reads that indicate structural variants)?
Seems like this would have appeal in the clinic or other areas where people are not eager to revisit the alignment.
It seems that there are two issues standing in the way of such an implementation:
I'm presuming that you want to start with fastq files. There is no sequence storage container that I know of that is capable of storing only alignment data and leaving the fastq data untouched. This could be useful to circumvent the creation of a SAM file. Instead, you would get an only somewhat large file that just indexes the fastq file(s) and maps reads to genome coordinates. You need this since you'll not be able to call variants until all of the reads are properly mapped.
You still have to map the reads. This takes some time, but after the mapping step you can just stream the reads though an SNV and indel caller. The SAM->BAM->pileup->BCF->VCF pipeline you mention is not practical. In the real world you can get away with SAM-|-pileup-|-uncompressed bcf-|-vcf, where the -|- represents a stream or "pipe" structure.
Aside from the issue of duplicating your data when mapping reads into SAM format, the tools already exist to do what you suggest. I do agree that clinical labs and infrequent users of these types of tools would benefit from a packaged workflow that would allow this.
Alignment takes us about 25-35% of the total time from fastq to annotated vcf. The tabulation is similar to a pileup structure in memory. The question is how much memory would the bare minimum exome pileup take such that that you could avoid creating intermediates? Right now I think most variant pipelines are spending an inordinate amount of time writing to disk.
If the concern is whether we can hold enough of the pileup type structure in memory so that we can properly call indels, I would say that holding a sliding window of about a megabase would be sufficient for now. What occurs to me is that, in combination with a file format such as what I proposed in my first point, you could write a genome walker similar in scope to what GATK provides for SAM. At this point, we could be talking about a two step process: (1) Map the reads, but produce only an index in memory. (2) Scan through the reads in indexed order, producing a histogram of base calls with a breadth larger than any indel you expect to call. There may be an issue involving the amount of time you'll spend seeking while reading the mapped reads, but this must also be incurred with an unsorted SAM.
The main issue here is sorting the bam file: the later steps have to wait until the alignment is complete so that the sort can finish. Furthermore, samtools sort will dump reads to disk if they exceed the memory limit given by -m. However, this should be the only disk output other than the final vcf.gz file.
This is a general problem, though: variant calling requires all reads that map to a locus, which requires going through all the reads. If all reads fit in memory (believable for the exome case), then you're okay. Otherwise, and in more sophisticated cases like multi-sample calling, you're forced to go to disk because of the sheer amount of data.
Can we do better? The most direct approaches are filtering at the read level or filtering at the genome level. Some ideas:
Attempt to call variants only in "interesting regions" (defined by application, or maybe near e.g. existing dbSNP variants). Reads mapping elsewhere could be quickly filtered out before bwa with say a Bloom filter, or discarded immediately after bwa output (so that the samtools sort buffer stays smaller). The latter could be accomplished by passing a BED file of targeted regions to samtools view.
Parallelize the operation by region or chromosome. Related to the first idea, this would involve filtering between bwa and samtools sort. You could start an instance of the pipeline for each chromosome, handing it all reads, but pass each instance of samtools view a different region of chr1, chr2, etc. Again, the goal is to keep the samtools sort buffer small and avoid writing to disk.
thanks I think we are all familiar with pipes but what i imagine involves forking an open source aligner to do mapping and immediate tabulation, so we take the rigamarole of sam/bam out of the picture entirely and deal with a compact in-memory pileup histogram. The read data is immediately tossed. This would address the issue that most variant calling analyses don't give a damn about reads or alignments other than for base quality, phasing, and MAPQ. Those can be addressed in time. I think a direct approach might be attractive in this limited application.
Alignment takes us about 25-35% of the total time from fastq to annotated vcf. The tabulation is similar to a pileup structure in memory. The question is how much memory would the bare minimum exome pileup take such that that you could avoid creating intermediates? Right now I think most variant pipelines are spending an inordinate amount of time writing to disk.
If the concern is whether we can hold enough of the pileup type structure in memory so that we can properly call indels, I would say that holding a sliding window of about a megabase would be sufficient for now. What occurs to me is that, in combination with a file format such as what I proposed in my first point, you could write a genome walker similar in scope to what GATK provides for SAM. At this point, we could be talking about a two step process: (1) Map the reads, but produce only an index in memory. (2) Scan through the reads in indexed order, producing a histogram of base calls with a breadth larger than any indel you expect to call. There may be an issue involving the amount of time you'll spend seeking while reading the mapped reads, but this must also be incurred with an unsorted SAM.