When calling somatic variants in ONT cancer data (from tumours), is there grounds for using matched short-read (illumina) WGS from normal (blood) to remove germline variants if matched normal sequencing from long-reads isn't available?
Leading on from this, would it be sensible, given a wider cohort of short-read germline variant calls from many individuals, to generate a larger panel of normals to subtract from long-read tumour sequencing variant calls?
The primary reason being the expense of sequencing normal tissue if matched SR-WGS is already available...
Most studies have matched tumour-normal pairs, but I have found an example where a panel of normals is constructed from a mixture of long- and short-read sequencing data.
And another where SVs from multiple sequencing technologies were filtered against 15 healthy genomes sequenced with pacbio.
Some tools such as nanomonsv have a panel of normal function included, which makes use of 30 healthy ONT normal samples from the human pangenome reference consortium.
Then there is the question of population databases such as dbSNP and gnomAD, which are also not based on long-read data (to my knowledge)
I like this question. Thank you for digging into this!
See also https://github.com/KolmogorovLab/Severus for another tool using tumor-normal pairs.
I guess you could use a PoN based on short reads, but that will inherently be incomplete so you will always miss things that are only detected with long reads.
What about using a healthy genome sequenced with long-reads e.g. HG002 ?