Extracting values with certain patterns by searching all columns for each line using "awk"
1
0
Entering edit mode
4.1 years ago
ClkElf ▴ 50

Hey all,

I am a beginner in awk programming and trying to extract only certain strings from "INFO" column in ClinVar vcf file. For this, I printed only $8 (which is "INFO") from the ClinVar vcf file and changed semicolons with tab:

awk -F '\t' '{print $8}' test.vcf | sed 's/;/    /g' > trial1.vcf

And the output vcf file (trial1.vcf) looks something like this:

CLNDISDB=MedGen:CN517202    CLNDN=not_provided    CLNHGVS=NC_000001.10:g.879375C>T
CLNSIG=Conflicting_interpretations_of_pathogenicity CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1)

Now, I would like to extract elements starting with "CLNDN", "CLNSIG", "CLNSIGCONF" and "ORIGIN" by searching all the columns, and write them into $1, $2, $3 and $4 in a new file for each line with awk. This task is a part of my training and I have been on it for the last week. However, I couldn't come up with a solution. So, I would be very appreciated if you could show me a way to do this.

Thank you very much!

awk bash unix • 1.2k views
ADD COMMENT
0
Entering edit mode
4.1 years ago

Awk is the wrong tool here. Use bcftools query -f '%INFO/CLNDN %INFO/CLNSIG\n' in.vcf

if you wanna use awk , it would be something like

awk -F '\t'  '/^[^#]/ {n=split($8,a,/[;]/);for(i=1;i<=n;i++) {if(a[i] ~ /^CLNDISDB=/ || a[i] ~ /^CLNHGVS=/) printf("%s ",a[i]); } printf("\n");}' in.vcf
ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, it helped me a lot!

I have one more question: What is the way of writing each values in specified columns in output file? For example, I want to write CLNDISDB in Column1 of the output file for each line, CLNHGVS in Column2 etc...

ADD REPLY
1
Entering edit mode

search for each INFO with a loop.

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6