I have a pipeline written in Hail 0.1 for VCF processing:
https://github.com/macarthur-lab/hail-elasticsearch-pipelines
The pipeline can't process VCF files that have 'nan' values in 'Float' - type of fields. So, I found the solution of making a fake header with 'String' - types instead of 'Float's and then using it to load VCF into the pipeline:
https://discuss.hail.is/t/vds-summarize-report-error-in-hail-0-1/562/7
So, now I need a way to automate it. I want to make a script that loads VCF, tests for any Float values that have 'nan' values in it, and then changes the types of fields to String. Are there any good tools for such VCF - parsing and modification? The only solution I can think of for now is to just straight apply Hail 0.1 methods that cause exception (like 'summarize()' method) in a for loop and using regular matching schemes correcting Float to String which does sound too complex, I would expect easier solutions out there.
So, now I need a way to automate it.
?
I am wondering, what is ATT? I need to somehow check the whole file, find 'nan', then check what is the field type, and then change it to String. Another way - and maybe way simpler - would be just to substitute 'nan' to zero, but is it correct? Maybe it is, I just try to make sure that it would not change down the road calculations to something wrong.
If by 'ATT' you mean any attribute name, then maybe I need somehow to substitute it to make it general since I do not know beforehand the list of attributes that will contain 'nan'
Still getting an error:
unable to convert [1.618, nan] (of class java.util.ArrayList) to Array[Double]
Probably need to add [...] somehow
I am thinking of just using
sed 's/nan/0/g'
but afraid that if 'nan' happens as a part of some name, it will be mistakenly substitutedI tried making it match the other pattern but it is not working: