awk filtering annovar txt files
2
0
Entering edit mode
9.6 years ago
basalganglia ▴ 40

Hello everyone,

I have annovar result as a.hg19_multianno.txt and I want to filter following variants. these variants are found in 24 column. My following awk code is not working.

awk -F "\t" '{
  if (($24=="disruptive_inframe_deletion" || $24=="disruptive_inframe_insertion" || $24=="exon_loss_variant" || $24=="frameshift_variant" || $24=="frameshift_variant+start_lost" || $24=="frameshift_variant+stop_gained" || $24=="frameshift_variant+stop_lost" || $24=="inframe_deletion" || $24=="inframe_insertion" || $24=="initiator_codon_variant" || $24=="missense_variant" || $24=="splice_acceptor_variant" || $24=="splice_donor_variant" || $24=="splice_region_variant" || $24=="start_lost" || $24=="start_lost+inframe_deletion" || $24=="stop_gained" || $24=="stop_gained+disruptive_inframe_deletion" || $24=="stop_gained+disruptive_inframe_insertion" || $24=="stop_gained+inframe_insertion" || $24=="stop_lost" || $24=="stop_lost+disruptive_inframe_deletion" || $24=="stop_lost+inframe_deletion" || $24=="stop_retained_variant" || $24=="TF_binding_site_variant"))
    print
}'

Is there anyone that can help me?

Thanks!

awk annovar • 4.3k views
ADD COMMENT
0
Entering edit mode

Can you paste a line from the text file containing one of these variants?

ADD REPLY
1
Entering edit mode

It cannot work

ADD REPLY
0
Entering edit mode

Paste a line of your input file where you're having problems, please.

ADD REPLY
0
Entering edit mode
Chr     Start   End     Ref     Alt     ExAC_ALL        ExAC_AFR        ExAC_AMR        ExAC_EAS        ExAC_FIN        ExAC_NFE        ExAC_OTH        ExAC_SAS        Otherinfo
1       12783   12783   G       A       .       .       .       .       .       .       .       .       0.5     881.62  27      1       12783   .       G       A       881.62  .       ABHet=0.279;ABHom=0.689;AC=33;AF=0.786;AN=42;BaseQRankSum=2.245;DP=1005;Dels=0.00;FS=0.000;HaplotypeScore=0.1330;InbreedingCoeff=0.0782;MLEAC=33;MLEAF=0.786;MQ=5.42;MQ0=949;MQRankSum=-0.409;OND=0.293;QD=1.77;ReadPosRankSum=-0.211;ANN=A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000438504|unprocessed_pseudogene||n.*1783C>T|||||1580|,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000541675|unprocessed_pseudogene||n.*1416C>T|||||1580|,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000423562|unprocessed_pseudogene||n.*1669C>T|||||1580|,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000488147|unprocessed_pseudogene||n.*1351C>T|||||1621|,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000538476|unprocessed_pseudogene||n.*1583C>T|||||1628|,A|intron_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript|2/2|n.468+62G>A||||||,A|intron_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000515242|transcribed_unprocessed_pseudogene|2/2|n.465+62G>A||||||,A|intron_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000518655|transcribed_unprocessed_pseudogene|2/3|n.481+62G>A||||||,A|intron_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000450305|transcribed_unprocessed_pseudogene|3/5|n.182+86G>A||||||   GT:AD:DP:GQ:PL  0/1:3,25:27:15:102,0,15
ADD REPLY
3
Entering edit mode

Please, edit your question and add an example input line in a correct format to see it in a clear way. Also, it is appreciated if you try to explain the problem. For example:

  • The awk code is not working because of an error 'X' and paste the error message also on the question.
  • The awk code seems OK but I'm not getting the expected output. In this case if you can explain what you are expecting and blablabla..

In this way the people can help you faster and probably they would suggest you the correct solution, and not another things due to a misunderstanding.

ADD REPLY
0
Entering edit mode

So, in your example, do you expect to have that line in the output because there is a match? Which one is the match?

ADD REPLY
0
Entering edit mode

This line is only first line of my VCF. Other lines I know include interesed variants

ADD REPLY
0
Entering edit mode

Are you sure about $24?

 1    Chr
 2    Start
 3    End
 4    Ref
 5    Alt
 6    ExAC_ALL
 7    ExAC_AFR
 8    ExAC_AMR
 9    ExAC_EAS
10    ExAC_FIN
11    ExAC_NFE
12    ExAC_OTH
13    ExAC_SAS
14    Otherinfo
ADD REPLY
0
Entering edit mode

I have solved my problem with using following command from Jorge Amigo, also michael.ante's command can work too.

awk'/(disruptive_inframe_deletion|disruptive_inframe_insertion|exon_loss_variant|frameshift_variant|start_lost|stop_gained|stop_lost|inframe_deletion|inframe_insertion|initiator_codon_variant|missense_variant|splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_retained_variant|TF_binding_site_variant)/' a.hg19_multianno.txt

I am the beginner of the awk code, so I have many problems :)

My next question is filtering Exac values less and equal than 0.02 and including unknown variants ".". I have written a code as ;

cat a.txt | awk '$6 <= "0.02"' | awk '$6 == "."' >

It does not work. How I can manipulate this?

ADD REPLY
0
Entering edit mode

when you have a new question it's better to open a new one. if you want to continue asking about the same things I would either edit or comment (I indeed moved this new question to this comment section) your original question. I see you've already done so on awk code for Exac MAF values, so it would be wise to edit or delete this comment.

PS: the answer is awk '$6<0.02' a.txt

ADD REPLY
2
Entering edit mode
9.6 years ago

Since the information you're looking for can only happen on a single column (this time is $24, but it can be on any other if you add/remove annotations), I would suggest a simpler solution:

awk '/(disruptive_inframe_deletion|disruptive_inframe_insertion|exon_loss_variant|frameshift_variant|start_lost|stop_gained|stop_lost|inframe_deletion|inframe_insertion|initiator_codon_variant|missense_variant|splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_retained_variant|TF_binding_site_variant)/' a.hg19_multianno.txt

using this command you'd be looking for any occurrence of any of those strings anywhere in each input file's line. it won't be splitting each line by tabs, so you'll be also saving parsing time that would compensate the fact that the pattern matching would be applying to the entire line.

note that since you'd be pattern matching you won't need to look for all the mentioned combinations, so I removed them from the pattern to match. I'm sure it can be condensed down a little bit more, so go for it to save even more parsing time. just make sure that all the desired strings are indeed included in the pattern.

ADD COMMENT
0
Entering edit mode

Thank you so much !!! It is too easy and it works :)

ADD REPLY
0
Entering edit mode
9.6 years ago
michael.ante ★ 3.9k

I would replace each $24=="SOMESTRING" with match($24,/SOMESTRING/).

ADD COMMENT
0
Entering edit mode
awk -F "\t" '{
  if (($24,/disruptive_inframe_deletion/) || ($24,/disruptive_inframe_insertion/) || ($24,/exon_loss_variant/) || ($24,/frameshift_variant/) || ($24,/frameshift_variant+start_lost/) || ($24,/frameshift_variant+stop_gained/) || ($24,/frameshift_variant+stop_lost/) || ($24,/inframe_deletion/) || ($24,/inframe_insertion/) || ($24,/initiator_codon_variant/) || ($24,/missense_variant/) || ($24,/splice_acceptor_variant/) || ($24,/splice_donor_variant/) || ($24,/splice_region_variant/) || ($24,/start_lost/) || ($24,/start_lost+inframe_deletion/) || ($24,/stop_gained/) || ($24,/stop_gained+disruptive_inframe_deletion/) || ($24,/stop_gained+disruptive_inframe_insertion/) || ($24,/stop_gained+inframe_insertion/) || ($24,/stop_lost/ || $24,/stop_lost+disruptive_inframe_deletion/) || ($24,/stop_lost+inframe_deletion/) || ($24,/stop_retained_variant/) || ($24,/TF_binding_site_variant/))
    print
}'

Is it like that?

ADD REPLY
1
Entering edit mode

Use

awk -F "\t" '{if (match($24,/disruptive_inframe_deletion/) || match($24,/disruptive_inframe_insertion/) || match...

You might also have a look at some AWK tutorials, since I'm not sure whether each underscore and plus character have to be escaped.

ADD REPLY
0
Entering edit mode

Also, your command works too, thank you so much :)

ADD REPLY
0
Entering edit mode

It seems the entire string includes these chars but isn't one of chars. so use ~ and regular expression. like

awk '$24~ /stop_gained\+disruptive_inframe_deletion/' 1.t
ADD REPLY

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6