I have a vcf file with these column headers:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BS_25YES2E3 BS_G5B6AD28 BS_QCGPE1ZX
A sample feature within that vcf file
chr1 10450 . T C 27.94 VQSRTrancheSNP99.90to100.00+ AC=1;AF=0.167;AN=6;BaseQRankSum=-1.676e+00;ClippingRankSum=0.789;DP=102;ExcessHet=4.7712;FS=4.868;MLEAC=1;MLEAF=0.167;MQ=34.67;MQRankSum=-1.084e+00;PG=0,0,0;QD=1.55;ReadPosRankSum=-2.169e+00;SOR=0.707;VQSLOD=-1.050e+01;culprit=MQ;ANN=C|upstream_gene_variant|MODIFIER|**DDX11L1**|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1560|1||SNV|HGNC|HGNC:37102||||chr1:g.10450T>C,C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||1419|1||SNV|HGNC|HGNC:37102|YES|||chr1:g.10450T>C,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||3954|-1||SNV|HGNC|HGNC:38034|YES|||chr1:g.10450T>C GT:AD:DP:FT:GQ:JL:JP:PL:PP 0/0:28,0:28:lowGQ:0:1:1:0,0,663:0,0,666 0/1:13,5:18:PASS:35:1:1:34,0,342:35,0,345 0/0:44,0:44:lowGQ:0:1:1:0,0,802:0,0,805
The portion in bold is what I want (DDX11L1
). I want to sort the vcf file based on this sub column. This is under the info field under SYMBOL
. The metadata for info field is:
##INFO=<ID=ANN,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|SIFT|HGVS_OFFSET|HGVSg">
Literally any help would be great. I want to be able to collapse variants by gene so if you have a simpler way of doing this, it would be great too.