Hi, so I'm using Annovar for its gene annotation capabilities and I need some help changing how the input is setup.
The complete command I run on Annovar for its gene annotation capabilities is:
./annotate_variation.pl -geneanno -buildver hg19 MyData.avinput humandb
Here is a small example portion of my data
1 101814 101814 G T rs1231
1 1018940 1018940 T C rs546754564
1 1020131 1020131 A - rs234324
1 1032136 1032136 - T rs21313
1 1020514 1020514 T G rs645654
1 1022394 1022394 C G rs4356354
1 9023126 9023126 TA - rs4542342
1 10270690 10270690 CTCA - rs3275676
Where the first two variants are a simple base substitution, the third variant is a deletion of that base, and the last two variants are deletions of those bases. Here, all of the variants will get annotated, besides the last two variants, which produce an invalid_input error. This is because the last two sequence ranges must reflect the length of the DNA being deleted, in this case, 2bp and 4bp respectively.
In order to fix it, we'd have to make the last two lines say
1 9023126 9023127 TA -
1 10270690 10270693 CTCA -
To properly reflect the length of the sequence being deleted.
The problem is, my mentor gave me the data in the such erroneous format, with many many variants in this form, so I cannot do it manually. How might I do this computationally?
I know the psuedocode for such a problem would first
1) check if it was a deletion by
a) checking if the 5th column is a minus "-" character for that row, and then
b) checking the 4th column in that same row, if (a) was true, and seeing if it was a string of letters
if the latter is true, then
2) check how many letters the 4th column is, call that value "n"
3) add n-1 to the value in column 3.
How might I carry this out computationally on UNIX? I'm still kind of a novice at bioinformatics, but I'm pretty decent with OOP in my coursework. Thanks.
Just what I needed, thanks.