Question

Extracting values with certain patterns by searching all columns for each line using "awk"

0

Entering edit mode

4.1 years ago

ClkElf ▴ 50

Hey all,

I am a beginner in awk programming and trying to extract only certain strings from "INFO" column in ClinVar vcf file. For this, I printed only $8 (which is "INFO") from the ClinVar vcf file and changed semicolons with tab:

awk -F '\t' '{print $8}' test.vcf | sed 's/;/    /g' > trial1.vcf

And the output vcf file (trial1.vcf) looks something like this:

CLNDISDB=MedGen:CN517202    CLNDN=not_provided    CLNHGVS=NC_000001.10:g.879375C>T
CLNSIG=Conflicting_interpretations_of_pathogenicity CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1)

Now, I would like to extract elements starting with "CLNDN", "CLNSIG", "CLNSIGCONF" and "ORIGIN" by searching all the columns, and write them into $1, $2, $3 and $4 in a new file for each line with awk. This task is a part of my training and I have been on it for the last week. However, I couldn't come up with a solution. So, I would be very appreciated if you could show me a way to do this.

Thank you very much!

awk bash unix • 1.2k views

ADD COMMENT • link updated 4.1 years ago by Pierre Lindenbaum 164k • written 4.1 years ago by ClkElf ▴ 50

score 0 · Answer 1 · 2020-12-06

0

Entering edit mode

4.1 years ago

Pierre Lindenbaum 164k

Awk is the wrong tool here. Use bcftools query -f '%INFO/CLNDN %INFO/CLNSIG\n' in.vcf

if you wanna use awk , it would be something like

awk -F '\t'  '/^[^#]/ {n=split($8,a,/[;]/);for(i=1;i<=n;i++) {if(a[i] ~ /^CLNDISDB=/ || a[i] ~ /^CLNHGVS=/) printf("%s ",a[i]); } printf("\n");}' in.vcf

ADD COMMENT • link 4.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you very much for your answer, it helped me a lot!

I have one more question: What is the way of writing each values in specified columns in output file? For example, I want to write CLNDISDB in Column1 of the output file for each line, CLNHGVS in Column2 etc...