Large Scale SARS CoV 2 multiple Sequence Alignment
1
0
Entering edit mode
3.2 years ago
s_bio ▴ 10

Dear members,

I want to align all the SARS-CoV-2 complete genome sequences that are available till date (444424 in NCBI viruses). I have tried command line muscle but it is unable to handle such a large number of sequences. I would appreciate if anyone can let us know how to align these huge number of sequences.

Multiple SARS-CoV-2 MSA Sequence Genome Alignment • 1.9k views
ADD COMMENT
2
Entering edit mode

A large number of those assemblies are identical with one another or subsequences of longer assemblies. Quite a few also include degenerate bases which you would have to deal with somehow. My suggestion is that you first remove all redundancy from the data set by clustering..

ADD REPLY
2
Entering edit mode

in addition, there may be a polyA tail at the end, and that may have various lengths, for many genomes the sequences end like this:

>OU534909.1 Severe acute respiratory syndrome coronavirus 2 genome assembly, complete genome: monopartite [29797:29896]
TAATGTGTAAAATTAATTTTAGTAGTGCTATCCNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ADD REPLY
0
Entering edit mode

Thank you for your reply. We intend to do that but after the alignment process is over..

ADD REPLY
1
Entering edit mode

use high blosum matrix for alignment

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. Will surely try high blosum..

ADD REPLY
3
Entering edit mode
3.2 years ago
cfos4698 ★ 1.1k

nextalign is able to handle large numbers of sequences and produces a robust result: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextalign-cli.html

Another strategy involves aligning each individual sequence against the SARS-CoV-2 reference genome one-by-one, then combining the results; Check out the methods at https://github.com/roblanf/sarscov2phylo

You can also reverse-engineer the method used by pangolin. The pangolin method maps genomes against the SARS-CoV-2 reference genome using minimap2, then converts the sam-format mapped file to a multifasta file using gofasta. Check out the 'align_to_reference' rule here: https://github.com/cov-lineages/pangolin/blob/master/pangolin/scripts/pangolearn.smk

ADD COMMENT

Login before adding your answer.

Traffic: 2259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6