Entering edit mode
4.4 years ago
amer_ghl
•
0
Hello everyone,
I am trying to align full-length genomes of coronavirus. I have 1800 sequences and each sequence is about 30000 nt (50 Mb) I tried the webserver of MAFFT, MUSCLE, CLUSTAL OMEGA but they are functional only for a small data sample (4Mb at max), otherwise, they crush.
I would be thankful for any recommendations or suggestions.
There aren't really any good options for large scale multiple genome alignments, this is still something of an unsolved computational challenge.
That said, you could take a look at
mugsy
orLASTZ
which can handle larger data, but in my experience make pretty crappy alignments. Several 10s of kilobases is pretty much the limit for most tools.What exactly is your end goal? There may be a simpler orthogonal way you could approach the task.
Thanks for your reply. I am trying to identify all the possible mutations in the genome of the coronavirus. I checked some similar studies, but their samples were way smaller than mine (around 300 genomes).
I just thought it would be better to explore the mutations on a bigger sample.
Any ideas?
Just map each one to the reference and find the mutations, no need for MSA
we are talking about 1800 sequences, so how many individual alignments will you need? :)
Just choose one reference, this one is commonly used: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/ compare each other sequence to it and call mutations. Should be a nice exercise.
What Asaf said. Unless the genomes are really close, MSA will introduce alignment artefacts anyway. What you can do, relatively easily, is multiple pairwise alignment, e.g. with
mummer
or similar tools that others have suggested, and compare all the sequences to a particular reference.Coronavirus genomes so should be very close. Many may be sequence redundant so that number can be culled down to something smaller.
You can give minimap2 a try and also mummer
minimap2
is not going to generate a multiple sequence alignment.Yeah, didn't read the question through
not sure if it does multiple sequence alignment but you could have a look at https://github.com/genotoul-bioinfo/dgenies
Two additional options. Not that you don't have many already.
This is designated as the RefSeq Genome in NCBI.
Another option:
Rob Lanfear has a repo up where he has already run the MSA https://github.com/roblanf/sarscov2phylo/
He uses MAFFT too in this script https://github.com/roblanf/sarscov2phylo/blob/master/scripts/global_profile_alignment.sh
have you considered doing MSA on the multiple specific regions of interest. For example, doing MSA on all annotated protein coding regions in the reference genome against your 1800 genomes? I think the protein coding regions would be smaller than the limit for most softwares, and you can also run these MSAs in parallel