If CG has not made much change to their pipeline, CG alignments mainly consist of two parts: initial mapping and assembly around variants (called evidence as I remember). Initially CG only provided a tool to convert the mapping to SAM, but I think recently they have implemented another tool to convert assembly as well. Broad used to try GATK on their mapping (the first part). The SNP calls are reasonable, but not as good as CG's calls. I do not know if including the second part will greatly improve the accuracy. I have not seen such an experiment.
CG has done a very impressive job on SNP/indel calling (I am not experienced enough to comment on SVs), especially given their short and fragmented reads. Nonetheless, with much longer hiseq reads, I think Illumina pulls ahead, in comparison to one of their call sets made a year ago. CG is improving their sequencing machines and may have improved the variant caller since then. I do not have an updated comparison between platforms.
For now, my major concern about the CG SNP calls is they have overestimated regions of the genome they can make calls and as a result they have underestimated heterozygosity. In comparison to hiseq, CG calls 5% fewer SNPs due to shorter reads, but make calls in ~95% of the genome (I could be wrong about the exact percentage, but should be close), similar to the percentage from the hiseq data set. The hiseq heterozygosity matches the existing publications reasonably well. Then this means CG has underestimated it.
Personally I would also prefer CG to release their alignments in the SAM/BAM format, rather than to ask every user to convert by themselves. It is a pain to work with these huge files and slow conversion. All my friends/colleagues only use their variant calls but never look at alignment (the SAM Broad got was generated at CG). While sequencing vendors insist platform specific information is useful, which is definitely true, I more like to treat all platforms the same way. I have heard that for SOLiD and 454, platform independent data analyses can also yield good results, good enough for most researches.
Anyway, I really appreciate that CG has released their genome data, which has been of a great help to the community. Various people, me included, have also learned from their variant calling pipeline.
Hi Heng, can you elaborate the following quote a little more.... "For now, my major concern about the CG SNP calls is they have overestimated regions of the genome they can make calls and as a result they have underestimated heterozygosity."