##fileformat=VCFv4.3
##fileDate=20180421
##source=PLINKv2.00
##filedate=20180410
##contig=<ID=10,length=135524727>
##INFO= ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_A F|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE"
#CHROM POS ID REF ALT
I would like to only obtain the allele frequency (AF) data from the column. However, it is quite difficult for me to do so as all the data are clustered as one column. Are there any ways for me to overcome this? Thank you
Sorry misunderstood the question, ignore my answer... not sure then... maybe generate the VEP output in a tab format to avoid the clustering and then extract the AF column.
the code contains a variable 'tools' which itself contains a parser for the VEP output. There are some duplicated lines if there is more than one transcript per variant.
java -jar dist/bioalcidaejdk.jar -e 'println("CHROM\tPOS\tREF\tAF");stream().forEach(V->tools.getVepPredictions(V).stream().forEach(P->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+P.getByCol("AF"));}));'
CHROM POS REF AF
21 26960070 G 0.0014
21 26960070 G 0.0014
21 26960070 G 0.0014
21 26965148 G 0.7324
21 26965148 G 0.7324
21 26965148 G 0.7324
21 26965172 T 0.0106
21 26965172 T 0.0106
21 26965172 T 0.0106
21 26965205 T 0.7324
21 26965205 T 0.7324
21 26965205 T 0.7324
21 26976144 A 0.0004
21 26976144 A 0.0004
21 26976144 A 0.0004
(...)
OP wants the information that is contained into the VEP INFO/CSQ field, not the INFO/AF
Sorry misunderstood the question, ignore my answer... not sure then... maybe generate the VEP output in a tab format to avoid the clustering and then extract the AF column.