Question

Variant calls of published already assembled genomes

0

Entering edit mode

2.9 years ago

ebrier • 0

I have a set of short read sequencing for the 172 KB Epstein-barr virus genome. We successfully called our variants using GATK to a reference genome. A publication linked below from a different population compared variants (also from short read sequencing) to published already assembled EBV genomes. Unfortunately the raw short read sequencing data for the majority of published EBV genomes is not available. I believe to get around this the authors compared variants of these published assembled genomes to a single reference sequence. Is there a standard for this type of analysis? I realize we're taking those assemblies at face value, but not sure about other options. I tried using minimap2 and many samples which I know have variants had none detected.

minimap2 -cx asm5 --cs ebv_ref_genome.fa  ebv_assembled_genome.fa \
    | sort -k6,6 -k8,8n \
    | paftools.js call -f 'ebv_ref_genome.fa - > ${sample}.vcf

Genomic and Transcriptomic Characterization of Natural Killer T Cell Lymphoma https://www.sciencedirect.com/science/article/pii/S1535610820300945?via%3Dihub

minimap2 viralgenomics EBV genomeassembly variants • 1.2k views

ADD COMMENT • link updated 2.9 years ago by Istvan Albert 102k • written 2.9 years ago by ebrier • 0

score 0 · Answer 1 · 2022-01-06

0

Entering edit mode

2.9 years ago

Istvan Albert 102k

You can't call SNPs from a pairwise alignment. SNP calling processes expect multiple measurements and make use of various statistical quantities to produce a SNP call. Those quantities then may be used to filter the variants etc.

Long story short SNP calling is the process of figuring out a variant from a large number of short measurement.

What you seem to have is a single pairwise alignment, in that case you need to transform the pairwise alignment into variants. There are different ways to go about it.

One thing you could do is simulate perfect short reads from each of your assembled genomes, then use variant calling on all these simulated reads. Then you can leverage known pipelines to your use.

Another approach would be to transform a pairwise alignment into VCF. I wrote a tool to do that, since I needed a similar functionalty

https://www.bioinfo.help/bio-format.html

ADD COMMENT • link 2.9 years ago by Istvan Albert 102k

0

Entering edit mode

One could do a multiple genome alignment. EB virus is about 170 kb but since these are related genomes it should be feasible.

ADD REPLY • link 2.9 years ago by GenoMax 147k

0

Entering edit mode

yes, though here it is a little unclear of what the OP needs, I am just going by the example shown there that has a pairwise alignment and expects a VCF file.

When we align multiple sequences it is not clear that the results would be a valid VCF as there is no "reference" to speak of, the multiple sequences are aligned relative to one another - rather than a single reference as the VCF requires

I also found this tool that may be of use

https://github.com/sanger-pathogens/snp-sites

ADD REPLY • link 2.9 years ago by Istvan Albert 102k

1

Entering edit mode

A particular sequence can be designated as "reference" in some MSA programs. MAFFT allows one to do this.

ADD REPLY • link 2.9 years ago by GenoMax 147k

0

Entering edit mode

interesting, have not heard about this feature, look something new,

implemented for SARS-COV-2 specifically, I have been running/combining multiple pairwise alignments to try achieve the same

ADD REPLY • link 2.9 years ago by Istvan Albert 102k