Question

Sort a sub column within a column while keeping the feature (LINUX)

0

Entering edit mode

3.3 years ago

yash_verma • 0

I have a vcf file with these column headers:

#CHROM  POS     ID  REF   ALT   QUAL    FILTER  INFO    FORMAT     BS_25YES2E3  BS_G5B6AD28 BS_QCGPE1ZX

A sample feature within that vcf file

chr1    10450   .   T   C   27.94   VQSRTrancheSNP99.90to100.00+    AC=1;AF=0.167;AN=6;BaseQRankSum=-1.676e+00;ClippingRankSum=0.789;DP=102;ExcessHet=4.7712;FS=4.868;MLEAC=1;MLEAF=0.167;MQ=34.67;MQRankSum=-1.084e+00;PG=0,0,0;QD=1.55;ReadPosRankSum=-2.169e+00;SOR=0.707;VQSLOD=-1.050e+01;culprit=MQ;ANN=C|upstream_gene_variant|MODIFIER|**DDX11L1**|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1560|1||SNV|HGNC|HGNC:37102||||chr1:g.10450T>C,C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||1419|1||SNV|HGNC|HGNC:37102|YES|||chr1:g.10450T>C,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||3954|-1||SNV|HGNC|HGNC:38034|YES|||chr1:g.10450T>C GT:AD:DP:FT:GQ:JL:JP:PL:PP  0/0:28,0:28:lowGQ:0:1:1:0,0,663:0,0,666 0/1:13,5:18:PASS:35:1:1:34,0,342:35,0,345   0/0:44,0:44:lowGQ:0:1:1:0,0,802:0,0,805

The portion in bold is what I want (DDX11L1). I want to sort the vcf file based on this sub column. This is under the info field under SYMBOL. The metadata for info field is:

##INFO=<ID=ANN,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|SIFT|HGVS_OFFSET|HGVSg">

Literally any help would be great. I want to be able to collapse variants by gene so if you have a simpler way of doing this, it would be great too.

na • 844 views

ADD COMMENT • link updated 3.3 years ago by GokalpC ▴ 100 • written 3.3 years ago by yash_verma • 0

score 2 · Answer 1 · 2021-10-19

2

Entering edit mode

3.3 years ago

Istvan Albert 102k

The task you seek is overly specialized and narrow application, it is unlikely to find a tool that does it already.

Your best bet would be to write a simple parser in a programming language and do it yourself.

If you know a little programming it should be fairly straightforward.

ADD COMMENT • link 3.3 years ago by Istvan Albert 102k

score 0 · Answer 2 · 2021-10-19

0

Entering edit mode

3.3 years ago

GokalpC ▴ 100

You may want to use bcftools splitvep function or something similar to convert your vcf to a tabular format. Then you can use the sort function from linux to sort according to any column you want.

ADD COMMENT • link 3.3 years ago by GokalpC ▴ 100