Hi,
I have been recently dealing with the Personal Genomes Project, and trying to work with the data. I downloaded the raw data for an individual's whole genome.
The main concern is the format of the data. Complete Genomics frees the genomes of the individuals in its own format; a format called masterVar which looks like this:
#ASSEMBLY_ID GS000014558-ASM
#COSMIC COSMIC v48
#DBSNP_BUILD dbSNP build 132
#GENOME_REFERENCE NCBI build 37
#SAMPLE GS01669-DNA_D02
#GENERATED_BY cgatools
#GENERATED_AT 2012-Sep-28 19:43:38.251270
#SOFTWARE_VERSION 2.0.4.14
#FORMAT_VERSION 2.0
#GENERATED_BY dbsnptool
#TYPE VAR-ANNOTATION
>locus ploidy allele chromosome begin end varType reference alleleSeq varScoreVAF varScoreEAF varQuality hapLink xRef
17 2 all chr1 11365 11370 ref = =
302 2 1 chr1 21579 21580 snp C T 123 123 VQHIGH dbsnp.83:rs526642
302 2 2 chr1 21579 21580 snp C T 153 153 VQHIGH dbsnp.83:rs526642
They provide some tools to work on it and I tried to convert to vcf with this tool, but what I get is some kind of funny vcf, with duplicated entries and inconsistent information.
Has anyone dealt with it before?
Thanks in advance!
P.
Hi, I'm dealing with the same issue. Did you figure out any way to convert Complete Genomics to vcf or to plink ped format without bugs?
Thank you.
Hi, Unfortunately I was not able to make it work and quited by now. As said, I tried different conversion tools, but all returned a very weird file with clear errors compared to original. I am really surprised that no further information on this issue explaining the thing a bit more could be found... Anyway, if you get to know anything else, let me know about it, Best,