My software group has been required to write VCF files containing, among other things, regions of LOH (Loss of Heterozygosity) and copy number regions.
I would like to know how to do this in the most standard way, if there is one.
We know that copy number variants are usually described this way
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
But we make copy number calls that are not always integer. For example "2.5" could indicate mosaic copy number where half of the sample has CN=2 and half has CN=3. (By itself, 2.5 is ambiguous. It could actually be any mixture that averages out to 2.5 copies. But this number represents all the information we have available.)
Can we change the FORMAT definition for CN to allow Type=Float ? Or should we create a different FORMAT for CNM (CN Mosaic) ? Is there a standard FORMAT for LOH regions ?
More a thought then an answer: copy-number is actually integer, what matters here is the percentage of cells where this change occurred. CN2.5 is ambigous, but usually if you look at B-allele frequency you can kinda resolve this (if it is not a tumor sample where several CNAs happen at the same region). If you will move into developing your own format, I'd keep the following fields: cell fraction where variant happened, copy-number of allele 1, copy-number of allele 2.
Unfortunately we do not have access to information on cell fractions with different copy numbers. We only have the average copy number. So I'm looking for a way to encode the information that we have in VCF format. So, I'm trying to determine whether changing the format field to define "CN" as "Float" rather than integer is acceptable, or whether creating some different format field such as "CNM" is a better idea.