Add Phenotype OMIM number to vcf file
1
0
Entering edit mode
7.3 years ago
psmirin • 0

Hi,

How can I annotate or compare the variants I have got in my vcf file that include the gene name to known mutations in the same gene but different chr location for instance from OMIM? For example I have searched manually on OMIM for MED25 gene, this is the output:

Gene-Phenotype Relationships Location Phenotype Phenotype MIM number Inheritance Phenotype mapping key 19q13.33 ?Charcot-Marie-Tooth disease, type 2B2 605589 AR 3 Basel-Vanagait-Smirin-Yosef syndrome 616449 AR 3

And my file is a multi vcf :

chr19   50321587        rs114843375;138201      A       G       6630.07 PASS    AC=10;A_F=0.052;AN=192;BaseQRankSum=1.74;ClippingRankSum=0.069;DP=4747;FS=1.147;GQ_MEAN=141.68;GQ_STDDEV=124.09;InbreedingCoeff=0.1560;MLEAC=10;MLEA_F=0.052;MQ=57.41;MQ0=0;MQRankSum=0.789;NCC=0;POSITIVE_TRAIN_SITE;QD=15.07;ReadPosRankSum=0.548;SOR=0.593;VQSLOD=5.04;culprit=MQ;EFF=UTR_5_PRIME(MODIFIER||12||MED25|protein_coding|CODING|NM_030973.3|1);AF=4.645e-03;COMMON=1;AF_ESP=0.0112;AF_EXAC=0.00869;AF_TGP=0.0102;ALLELEID=141904;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000019.9:g.50321587A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=MED25:81857;MC=SO:0001623|5_prime_UTR_variant;ORIGIN=1;RS=114843375        GT:AD:DP:GQ:PL  0/0:42,0:42:99:0,103,1320       0/0:48,0:48:99:0,120,1800

The output I want is to add the Phenotype MIM number in the end of the row:

chr19   50321587        rs114843375;138201      A       G       6630.07 PASS    AC=10;A_F=0.052;AN=192;BaseQRankSum=1.74;ClippingRankSum=0.069;DP=4747;FS=1.147;GQ_MEAN=141.68;GQ_STDDEV=124.09;InbreedingCoeff=0.1560;MLEAC=10;MLEA_F=0.052;MQ=57.41;MQ0=0;MQRankSum=0.789;NCC=0;POSITIVE_TRAIN_SITE;QD=15.07;ReadPosRankSum=0.548;SOR=0.593;VQSLOD=5.04;culprit=MQ;EFF=UTR_5_PRIME(MODIFIER||12||MED25|protein_coding|CODING|NM_030973.3|1);AF=4.645e-03;COMMON=1;AF_ESP=0.0112;AF_EXAC=0.00869;AF_TGP=0.0102;ALLELEID=141904;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000019.9:g.50321587A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=MED25:81857;MC=SO:0001623|5_prime_UTR_variant;ORIGIN=1;RS=114843375        GT:AD:DP:GQ:PL  0/0:42,0:42:99:0,103,1320       0/0:48,0:48:99:0,120,1800  605589, 616449

How can I approach this?

Thank you,

Pola

gene sequencing • 2.3k views
ADD COMMENT
0
Entering edit mode

Hi @Pierre Lindenbaum,

I have generated a file through http://www.ensembl.org/biomart (like you suggested)

The file is :

19      49818279        49838816        MED25   610197  605589

19      49818279        49838816        MED25   610197  616449

My file:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0

I wanted to get this output:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0    **610197:605589:616449**

I have tried to run the following command:

    awk 'NR==FNR{a[$4]=$5":"$6;next}{print $0,"\t",a[$17]}' FS="\t" med25omim med25test

This is the output that I get:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0     **610197:616449**

I am missing this omimID 605589 Can you please help me to overcome this? Thank you very much, Pola

ADD REPLY
1
Entering edit mode
7.3 years ago

use biomart http://www.ensembl.org/biomart/martview/ to build a BED file containing

chrom, start,end, mim_gene_accession

remove the lines without "mim_gene_accession", substract '1' to the start column to switch to a 0-based reference, sort on chrom/start, compress with bgzip and index with tabix. Something like:

$ awk '/^Gene/ {next;} {if($4!="") printf("%s\t%d\t%s\t%s\n",$3,int($1)-1,$2,$4);}' biomart.txt | sort -t $'\t' -k1,1 -k2,2n | bgzip > omim.bed.gz
$ tabix -f -p bed omim.bed.gz

then use something like http://lindenb.github.io/jvarkit/VCFBed.html to transfer the information into the VCF:

java -jar dist/vcfbed.jar -B omim.bed.gz -f '${4}' -T OMIM_ACN your.vcf
ADD COMMENT
0
Entering edit mode

Hi @Pierre Lindenbaum,

I have generated a file through http://www.ensembl.org/biomart (like you suggested)

The file is :

19      49818279        49838816        MED25   610197  605589

19      49818279        49838816        MED25   610197  616449

My file:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0

I wanted to get this output:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0    **610197:605589:616449**

I have tried to run the following command:

    awk 'NR==FNR{a[$4]=$5":"$6;next}{print $0,"\t",a[$17]}' FS="\t" med25omim med25test

This is the output that I get:

chr19   50338843        rs193291405;220933      C       G       PASS    1       4.401e-03       0.602   MedGen:CN043576|MedGen:CN169374 Charcot-Marie-Tooth_disease,_type_2|not_specified       NC_000019.9:g.50338843C>G        Benign/Likely_benign    single_nucleotide_variant       1       MODERATE        MED25   NM_030973.3     protein_coding  NON_SYNONYMOUS_CODING   MISSENSE        gCc/gGc A576G   CODING  4.66    B9TX30,B5ME50,Q71SY5     N       D,D,D,D,T,D,D,D B,B,B   N       5.71    B,B,B   .       0/0     0/0      55,0    38,0     **610197:616449**

I am missing this omimID 605589 Can you please help me to overcome this? Thank you very much, Pola

ADD REPLY

Login before adding your answer.

Traffic: 1743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6