Question

Large Scale SARS CoV 2 multiple Sequence Alignment

0

Entering edit mode

3.2 years ago

s_bio ▴ 10

Dear members,

I want to align all the SARS-CoV-2 complete genome sequences that are available till date (444424 in NCBI viruses). I have tried command line muscle but it is unable to handle such a large number of sequences. I would appreciate if anyone can let us know how to align these huge number of sequences.

Multiple SARS-CoV-2 MSA Sequence Genome Alignment • 1.9k views

ADD COMMENT • link 3.2 years ago by s_bio ▴ 10

2

Entering edit mode

A large number of those assemblies are identical with one another or subsequences of longer assemblies. Quite a few also include degenerate bases which you would have to deal with somehow. My suggestion is that you first remove all redundancy from the data set by clustering..

ADD REPLY • link 3.2 years ago by 5heikki 11k

2

Entering edit mode

in addition, there may be a polyA tail at the end, and that may have various lengths, for many genomes the sequences end like this:

>OU534909.1 Severe acute respiratory syndrome coronavirus 2 genome assembly, complete genome: monopartite [29797:29896]
TAATGTGTAAAATTAATTTTAGTAGTGCTATCCNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

ADD REPLY • link 3.2 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you for your reply. We intend to do that but after the alignment process is over..

ADD REPLY • link 3.2 years ago by s_bio ▴ 10

1

Entering edit mode

use high blosum matrix for alignment

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you for the suggestion. Will surely try high blosum..

ADD REPLY • link 3.2 years ago by s_bio ▴ 10

score 3 · Accepted Answer · 2021-09-30

nextalign is able to handle large numbers of sequences and produces a robust result: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextalign-cli.html

Another strategy involves aligning each individual sequence against the SARS-CoV-2 reference genome one-by-one, then combining the results; Check out the methods at https://github.com/roblanf/sarscov2phylo

You can also reverse-engineer the method used by pangolin. The pangolin method maps genomes against the SARS-CoV-2 reference genome using minimap2, then converts the sam-format mapped file to a multifasta file using gofasta. Check out the 'align_to_reference' rule here: https://github.com/cov-lineages/pangolin/blob/master/pangolin/scripts/pangolearn.smk