Hello everybody,
I am trying to parse CSQ and AND fields from a VCF using PERL. The line describing the fields looks like
INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format:Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">
and the line with the information looks like
CSQ=A|synonymous_variant|LOW|IL17RE|ENSG00000163701|Transcript|ENST00000295980|protein_coding|15/17||ENST00000295980.3:c.1344G>A|ENST00000295980.3:c.1344G>A(p.%3D)|1461|1344|448|P|ccG/ccA|rs455863|1||1||SNV|1|HGNC|18439|YES|||CCDS2589.1|ENSP00000295980|Q8NFR9||UPI000003E87E||||hmmpanther:PTHR15583&hmmpanther:PTHR15583:SF5&Pfam_domain:PF15037||A:0.3620|A:0.3511|A:0.4054|A:0.4625|A:0.0933|A:0.5447|A:0.3211|A:0.4151|A:0.5324|A:0.4368|A:0.461|A:0.4018|A:0.4609|A:0.08017|A:0.5853|A:0.538|A:0.4945||||||||||||
when i use a split command to build a @info_csq ad @values_csq files i get a different number of element in both arrays
Can anyone tell me how to correlate the field header and its value ? i just tried to do it with the Vcf.pm library, but it sems not possible
thanks
that is perfect, thanks, anyway your script do not parse the csq field, where i have the problem between the different number of fields and headaers
it does
ups
i did not saw it, thanks
i finally have managed with the csq field, thanks
Hi, I am running into the same problem of parsing the CSQ field, I was wondering how you solved it? It'd be a huge help, thanks!
it can be done by two steeps
first, i parsed the info csq field to keep the "headers" in an array, let's call it @CSQheaders
the second steps is parsing the data field ($datafield), this field must be processed in a two nested loops (in Perl a foreach loop and a for loop, I don't know your favorite language).
the final structure will look like:
i hope i could explain clearly enough. If you have any question just ask