Hey all,
I am a beginner in awk programming and trying to extract only certain strings from "INFO" column in ClinVar vcf file. For this, I printed only $8 (which is "INFO") from the ClinVar vcf file and changed semicolons with tab:
awk -F '\t' '{print $8}' test.vcf | sed 's/;/ /g' > trial1.vcf
And the output vcf file (trial1.vcf) looks something like this:
CLNDISDB=MedGen:CN517202 CLNDN=not_provided CLNHGVS=NC_000001.10:g.879375C>T
CLNSIG=Conflicting_interpretations_of_pathogenicity CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1)
Now, I would like to extract elements starting with "CLNDN", "CLNSIG", "CLNSIGCONF" and "ORIGIN" by searching all the columns, and write them into $1, $2, $3 and $4 in a new file for each line with awk. This task is a part of my training and I have been on it for the last week. However, I couldn't come up with a solution. So, I would be very appreciated if you could show me a way to do this.
Thank you very much!
Thank you very much for your answer, it helped me a lot!
I have one more question: What is the way of writing each values in specified columns in output file? For example, I want to write CLNDISDB in Column1 of the output file for each line, CLNHGVS in Column2 etc...
search for each INFO with a loop.