Question

convert CSQ and ANN fields to a tab delimitet document with its headers

1

Entering edit mode

7.6 years ago

paumarc ▴ 20

Hello everybody,

I am trying to parse CSQ and AND fields from a VCF using PERL. The line describing the fields looks like

INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format:Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">

and the line with the information looks like

CSQ=A|synonymous_variant|LOW|IL17RE|ENSG00000163701|Transcript|ENST00000295980|protein_coding|15/17||ENST00000295980.3:c.1344G>A|ENST00000295980.3:c.1344G>A(p.%3D)|1461|1344|448|P|ccG/ccA|rs455863|1||1||SNV|1|HGNC|18439|YES|||CCDS2589.1|ENSP00000295980|Q8NFR9||UPI000003E87E||||hmmpanther:PTHR15583&hmmpanther:PTHR15583:SF5&Pfam_domain:PF15037||A:0.3620|A:0.3511|A:0.4054|A:0.4625|A:0.0933|A:0.5447|A:0.3211|A:0.4151|A:0.5324|A:0.4368|A:0.461|A:0.4018|A:0.4609|A:0.08017|A:0.5853|A:0.538|A:0.4945||||||||||||

when i use a split command to build a @info_csq ad @values_csq files i get a different number of element in both arrays

Can anyone tell me how to correlate the field header and its value ? i just tried to do it with the Vcf.pm library, but it sems not possible

thanks

VCF CSQ ANN PARSING PERL • 5.4k views

ADD COMMENT • link updated 7.6 years ago by Pierre Lindenbaum 166k • written 7.6 years ago by paumarc ▴ 20

score 4 · Answer 1 · 2018-01-17

4

Entering edit mode

7.6 years ago

Pierre Lindenbaum 166k

my tools vcf2table breaks the ANN field: http://lindenb.github.io/jvarkit/VcfToTable.html

 ANN
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+
 | SO                    | Allele | Impact   | GeneName | GeneId          | FeatureType | FeatureId       | BioType         | HGVsc                   | Distance |
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2326     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2281     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | retained_intron | n.-2285_-2281delCAAAA   | 2281     |
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+

Can anyone tell me how to correlate the field header and its value ? i

you need to split by ', ' to get a prediction for for each transcript and then split by '|' to get the fields: http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

that is perfect, thanks, anyway your script do not parse the csq field, where i have the problem between the different number of fields and headaers

ADD REPLY • link 7.5 years ago by paumarc ▴ 20

1

Entering edit mode

your script do not parse the csq field

it does

VEP
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+
 | PolyPhen                 | EXON | SIFT           | ALLELE_NUM | Gene            | SYMBOL | Protein_position | Consequence                                   | Amino_acids | Codons  | Feature         | BIOTYPE              |
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+
 | probably_damaging(0.956) | 8/9  | deleterious(0) | 1          | ENSG00000102967 | DHODH  | 346/395          | missense_variant                              | R/W         | Cgg/Tgg | ENST00000219240 | protein_coding       |
 |                          | 3/4  |                | 1          | ENSG00000102967 | DHODH  |                  | non_coding_exon_variant&nc_transcript_variant |             |         | ENST00000571392 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000572003 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000573843 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000573922 | processed_transcript |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  | -/193            | intron_variant                                |             |         | ENST00000574309 | protein_coding       |
 | probably_damaging(0.946) | 8/9  | deleterious(0) | 1          | ENSG00000102967 | DHODH  | 344/393          | missense_variant                              | R/W         | Cgg/Tgg | ENST00000572887 | protein_coding       |
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+

ADD REPLY • link 7.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

ups

i did not saw it, thanks

ADD REPLY • link 7.5 years ago by paumarc ▴ 20

0

Entering edit mode

i finally have managed with the csq field, thanks

ADD REPLY • link 7.5 years ago by paumarc ▴ 20

0

Entering edit mode

Hi, I am running into the same problem of parsing the CSQ field, I was wondering how you solved it? It'd be a huge help, thanks!

ADD REPLY • link 7.5 years ago by jiyoochang95 • 0

0

Entering edit mode

it can be done by two steeps

first, i parsed the info csq field to keep the "headers" in an array, let's call it @CSQheaders

the second steps is parsing the data field ($datafield), this field must be processed in a two nested loops (in Perl a foreach loop and a for loop, I don't know your favorite language).

the final structure will look like:

     foreach $spliẗ_by_comma(split(/,/,$datafield)) <-$datafield contain the data parsed from the csq field

            @split_by_bar =split(/\|/,$split_by_comma) #put the data into an array

           #here comes the triki part

                   for $i(0..$#CSQheaders) (for those not proraming in perl $# returns the numers of elements in a array

                                print "$CSQheaders[$i]: $split_by_bar[$i]

i hope i could explain clearly enough. If you have any question just ask

ADD REPLY • link 7.5 years ago by paumarc ▴ 20