I have a set of short read sequencing for the 172 KB Epstein-barr virus genome. We successfully called our variants using GATK to a reference genome. A publication linked below from a different population compared variants (also from short read sequencing) to published already assembled EBV genomes. Unfortunately the raw short read sequencing data for the majority of published EBV genomes is not available. I believe to get around this the authors compared variants of these published assembled genomes to a single reference sequence. Is there a standard for this type of analysis? I realize we're taking those assemblies at face value, but not sure about other options. I tried using minimap2 and many samples which I know have variants had none detected.
minimap2 -cx asm5 --cs ebv_ref_genome.fa ebv_assembled_genome.fa \
| sort -k6,6 -k8,8n \
| paftools.js call -f 'ebv_ref_genome.fa - > ${sample}.vcf
Genomic and Transcriptomic Characterization of Natural Killer T Cell Lymphoma https://www.sciencedirect.com/science/article/pii/S1535610820300945?via%3Dihub
One could do a multiple genome alignment. EB virus is about 170 kb but since these are related genomes it should be feasible.
yes, though here it is a little unclear of what the OP needs, I am just going by the example shown there that has a pairwise alignment and expects a VCF file.
When we align multiple sequences it is not clear that the results would be a valid VCF as there is no "reference" to speak of, the multiple sequences are aligned relative to one another - rather than a single reference as the VCF requires
I also found this tool that may be of use
https://github.com/sanger-pathogens/snp-sites
A particular sequence can be designated as "reference" in some MSA programs. MAFFT allows one to do this.
interesting, have not heard about this feature, look something new,
implemented for SARS-COV-2 specifically, I have been running/combining multiple pairwise alignments to try achieve the same