Has Anyone Had Experience With Complete Genomics Sequencing And Data Analysis?
5
10
Entering edit mode
13.5 years ago
jvijai ★ 1.2k

From what I understand, CGI does their sequencing in a non-compatible format with the others out there.

There was an announcement of using DNANexus for their data visualization, but no further details were available.

  1. Has anyone worked with CGAtools?
  2. What are the pros and cons for the uninitiated end-users?
  3. Since, the data can be converted to SAM, (map2sam), are the subsequent steps similar to any SOLID and Illumina data.
  4. Is the alignment of reads still incompatible with other tools?

Thanks
~VJ

sequencing format next-gen sam • 6.7k views
ADD COMMENT
10
Entering edit mode
13.5 years ago
lh3 33k

If CG has not made much change to their pipeline, CG alignments mainly consist of two parts: initial mapping and assembly around variants (called evidence as I remember). Initially CG only provided a tool to convert the mapping to SAM, but I think recently they have implemented another tool to convert assembly as well. Broad used to try GATK on their mapping (the first part). The SNP calls are reasonable, but not as good as CG's calls. I do not know if including the second part will greatly improve the accuracy. I have not seen such an experiment.

CG has done a very impressive job on SNP/indel calling (I am not experienced enough to comment on SVs), especially given their short and fragmented reads. Nonetheless, with much longer hiseq reads, I think Illumina pulls ahead, in comparison to one of their call sets made a year ago. CG is improving their sequencing machines and may have improved the variant caller since then. I do not have an updated comparison between platforms.

For now, my major concern about the CG SNP calls is they have overestimated regions of the genome they can make calls and as a result they have underestimated heterozygosity. In comparison to hiseq, CG calls 5% fewer SNPs due to shorter reads, but make calls in ~95% of the genome (I could be wrong about the exact percentage, but should be close), similar to the percentage from the hiseq data set. The hiseq heterozygosity matches the existing publications reasonably well. Then this means CG has underestimated it.

Personally I would also prefer CG to release their alignments in the SAM/BAM format, rather than to ask every user to convert by themselves. It is a pain to work with these huge files and slow conversion. All my friends/colleagues only use their variant calls but never look at alignment (the SAM Broad got was generated at CG). While sequencing vendors insist platform specific information is useful, which is definitely true, I more like to treat all platforms the same way. I have heard that for SOLiD and 454, platform independent data analyses can also yield good results, good enough for most researches.

Anyway, I really appreciate that CG has released their genome data, which has been of a great help to the community. Various people, me included, have also learned from their variant calling pipeline.

ADD COMMENT
0
Entering edit mode

Hi Heng, can you elaborate the following quote a little more.... "For now, my major concern about the CG SNP calls is they have overestimated regions of the genome they can make calls and as a result they have underestimated heterozygosity."

ADD REPLY
3
Entering edit mode
13.5 years ago

CGAtools is a command-line program. CGI data also still require some bioinformatics input for answering anything but the most rudimentary questions. You cannot expect to convert to SAM and then follow the usual Illumina/SOLID data processing. CGI generally present data that is a few steps removed from alignment and getting back to raw reads aligned to the genome is not a well-solved problem, at least in my experience.

ADD COMMENT
0
Entering edit mode

Sean, you mean the SAM format for CGI data is different from the published SAM format?

ADD REPLY
0
Entering edit mode

The SAM format they produce is standard SAM. However, CGI has two types of mappings to the genome--the mappings and the "evidence" files. They are not mutually exclusive and are used for different parts of their pipeline. Reproducing variant calls of one type or another from the SAM-converted mapping and evidence files is not going to be straightforward for this reason.

ADD REPLY
2
Entering edit mode
13.5 years ago

We are just starting to look into CGI data.

Although there is some use in the data they provide, they assume diploidy in CNA and variant calling, assumption that is not so good when working with tumours.

We converted to SAM. It is a "standard" sam, but different from what you would get from bwa. For instance, it is not clear, yet, which reads have multiple alignments.

  1. CGAtools is a command line program to deal with their files. Not used much, yet, but it seems OK
  2. If you are looking at "normal" genomes, they have already done quite a bit for you, and it might be also very well done (they do denovo assembly around regions they are going to 'call')
  3. As I said before, some program might rely on tags not present in the sam file produced by map2sam. However, it should work. No idea how the quality scores compare though
  4. You cannot align their reads (I think) as they have large gaps and, as far as I know, usual aligner can't cope. Please let me know if there is any that can.
ADD COMMENT
0
Entering edit mode

I want to offer one clarification. Depending on whether samples are designed at ‘tumor’ or ‘non-tumor’ (aka normal), sequence data at CGI is processed differently for copy number analysis. This is in large part because CGI doesn’t want to assume diploidy in tumor samples. Additionally, if a sample is designated ‘tumor’, our model does not assume that coverage should correspond to integer ploidy levels.

ADD REPLY
1
Entering edit mode
13.4 years ago
Len Trigg ★ 1.6k

RTG Investigator package from http://www.realtimegenomics.com makes a good addition to the results that Complete Genomics supply, letting you perform your own full reanalysis. The RTG tools support mapping Complete Genomics reads (which is particularly useful if you want to map to a genome other than the one that CG uses), and produce SAM files as output. You can then use the subsequent RTG tools for coverage, snp/indel calling etc, and most third party tools should also work fine.

ADD COMMENT
0
Entering edit mode
13.4 years ago

Hi VJ,

CGI delivers data in a format that is specific to our platform so that we can fully represent the richness of information that we generate. We have full appreciation of the need for data to be interoperable and compatible with existing tools available to the community; thus, we provide CGA Tools such as map2sam and evidence2sam that will convert our mappings to standard format SAM/BAM. In addition, we are developing software partnerships to increase the range of informatics solutions that will help our customers explore data in additional value-added ways. There is more information regarding support of CGI data in DNAnexus on our website. We have a Product Note and FAQs that address the upload, visualization, filtering, and export of CGI data in DNAnexus. The entire 69 Genome Data Set that we have made available for the public have been uploaded to DNAnexus for researchers to explore.

As Heng mentioned below, our mappings consist of two parts, initial mappings and evidence mappings (from the assembly performed to resolve variations) and each can be converted to SAM/BAM. Depending on what you would like to achieve in downstream analysis, the converted initial mappings or evidence mappings can be the appropriate starting point. We are currently building a tool that will allow you to combine the initial and evidence mappings and convert them to SAM/BAM, creating an input file that will serve as a better input for variant calling workflows or visualization and interrogation of read support at loci of interest.

Please let us know if you need information on anything contact us: support@completegenomics.com or 1-855-267-5383.

ADD COMMENT
0
Entering edit mode

Hi Anoop, you mentioned you guys are working on building a tool that can combine the initial and evidence mappings and convert them to SAM/BAM. Is this tool available now?

ADD REPLY

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6