Question

Mastervar - Complete Genomics Data Format To Vcf

5

Entering edit mode

11.4 years ago

Peixe ▴ 660

Hi,

I have been recently dealing with the Personal Genomes Project, and trying to work with the data. I downloaded the raw data for an individual's whole genome.

The main concern is the format of the data. Complete Genomics frees the genomes of the individuals in its own format; a format called masterVar which looks like this:

#ASSEMBLY_ID    GS000014558-ASM
#COSMIC    COSMIC v48
#DBSNP_BUILD    dbSNP build 132
#GENOME_REFERENCE    NCBI build 37
#SAMPLE    GS01669-DNA_D02
#GENERATED_BY    cgatools
#GENERATED_AT    2012-Sep-28 19:43:38.251270
#SOFTWARE_VERSION    2.0.4.14
#FORMAT_VERSION    2.0
#GENERATED_BY    dbsnptool
#TYPE    VAR-ANNOTATION
>locus    ploidy    allele    chromosome    begin    end    varType    reference    alleleSeq    varScoreVAF    varScoreEAF    varQuality    hapLink    xRef
17    2    all    chr1    11365    11370    ref    =    =                    
302    2    1    chr1    21579    21580    snp    C    T    123    123    VQHIGH        dbsnp.83:rs526642
302    2    2    chr1    21579    21580    snp    C    T    153    153    VQHIGH        dbsnp.83:rs526642

They provide some tools to work on it and I tried to convert to vcf with this tool, but what I get is some kind of funny vcf, with duplicated entries and inconsistent information.

Has anyone dealt with it before?

Thanks in advance!

P.

vcf • 6.2k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.4 years ago by Peixe ▴ 660

1

Entering edit mode

Hi, I'm dealing with the same issue. Did you figure out any way to convert Complete Genomics to vcf or to plink ped format without bugs?

Thank you.

ADD REPLY • link 11.2 years ago by galina.erikson ▴ 70

0

Entering edit mode

Hi, Unfortunately I was not able to make it work and quited by now. As said, I tried different conversion tools, but all returned a very weird file with clear errors compared to original. I am really surprised that no further information on this issue explaining the thing a bit more could be found... Anyway, if you get to know anything else, let me know about it, Best,

ADD REPLY • link 11.2 years ago by Peixe ▴ 660

Ram · Answer 1 · 2014-02-21

4

Entering edit mode

10.8 years ago

ash.avalon ▴ 40

You could use the CGAtools and use the following command, I left out the CGA_CEHQ and CGA_CEGL field

/software/cgatools/current/bin/cgatools mkvcf --beta --reference build36.crr \
            --source-names masterVar \
            --genome-root $root \
            --master-var <(bzcat $masterVar) \
            --field-names GT,PS,NS,AN,AC,AF,SS,FT,CGA_XR,CGA_ALTCALLS,CGA_FI,GQ,HQ,EHQ,GL,DP,AD,CGA_RDP,CGA_ODP,CGA_OAD,CGA_ORDP,CGA_PFAM,CGA_MIRB,CGA_RPT,CGA_SDO,CGA_SOMC,CGA_SOMR,CGA_SOMS,CGA_SOMF,GT,CGA_GP,CGA_NP,CGA_CP,CGA_PS,CGA_CT,CGA_TS,CGA_CL,CGA_LS,CGA_LAFS,CGA_LLAFS,CGA_ULAFS,CGA_SCL,CGA_SLS,CGA_LAFP,CGA_LLAFP,CGA_ULAFP,GT,FT,CGA_IS,CGA_IDC,CGA_IDCL,CGA_IDCR,CGA_RDC,CGA_NBET,CGA_ETS,CGA_KES,GT,FT,CGA_BF,CGA_MEDEL,MATEID,SVTYPE,CGA_BNDG,CGA_BNDGO,CGA_BNDMPC,CGA_BNDPOS,CGA_BNDDEF,CGA_BNDP

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 10.8 years ago by ash.avalon ▴ 40

0

Entering edit mode

Where do I find the full details of all these field descriptions?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by MAPK ★ 2.1k

0

Entering edit mode

This pretty much worked for me - however I had to first convert my genome.fa reference into the CRR format using another cgatool:

$cgatools fasta2crr --input build36.fa.bz2 --output build36.crr

see docs http://cgatools.sourceforge.net/docs/1.4.0/cgatools-install.pdf

Also for those not familiar with linux you need to assign your filename to masterVar variable (masterVar=my_file.tzv.bz2) for the <(bzcat $masterVar) part to work ( Note: not to be confused with the masterVar string in --source-names masterVar that tells the program the conversion type)

Also --genome-root $root was complaining so I left that parameter out (which seems to be OK if you also leave out the CGA_CEHQ and CGA_CEGL as suggested above).

In the process of converting 380 file now and seems to be producing sensible VCF files

(For the record the masterVar2VCF_rev1_ShanYang.zip converter on the complete genomics website did not work for me either. producing corrupt files)

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 6.4 years ago by kenny.bryan • 0