I am trying to align large number of sequences (~ 5 million) for mutation analysis. I have a reference sequence against which I need to align. Instead of performing multiple sequence alignment, which looks nigh impossible, can I perform pairwise alignments of each sequence with the reference sequence?
I have protein sequences in fasta format. They correspond to a viral protein whose length is around 1000 amino acids. To clarify, I do not need MSA here. I have the wild type sequence and want to align all sequences wrt it. I had tried MUSCLE previously but that did not work here
none of the 'classical' aligners will be able to do this (align protein to DNA). They all work with same type of sequences.
what I think you are looking for is protein-mappers (rather in the gene prediction area than aligner area), perhaps things like genewise, genomethreader, ... might be of use.
Did you try the new clustal omega ? it should be able to take a huge number of input sequences to perform alignment with.
As an alternative, you can try to do progressive alignment, as in: take the first 100 sequences, align those, get the result and use that to align the next 100 to it.
Tools such as t-coffee, muscle, emma, ... can also take a profile (==set of already aligned sequences) as input, as well as a fasta file.
for f in $(find /place/with/5mseqs -maxdepth 1 -type f -name "*.fasta"); do
alignProgram -input "$f" -reference /place/with/reference/genome.fasta -output ref_vs_$(basename "$f")
done
Actually this is exactly what I need. Since my ref is the WT protein, there is hardly any chance there will be any gaps in the ref sequence when aligned. The remaining sequences are equal or shorter in length to the ref sequence, so after alignment, the query sequences may incorporate 1 or more gaps in the alignment. In your example, the ref is shorter than seq 1-3. However the final output is what I'm looking for.
I'm not very proficient in Linux. Could you help me devise a code for this?
The final output should be in fasta as well:
Seq1
TAAAATGGGC
Seq2
TAAATGGGGC
Seq3
TAAAATGGGGC
Actually MUSCLE by default outputs the alignments in fasta format. All I have to do is pick the second alignment (query) and paste it to the final output.
Notice that we assume here that there are no linebreaks in any sequences. If there are, this will not work. If muscle creates linebreaks, this will not work. So first you need to linearize your input sequences and then if muscle creates linebreaks you also need to linearize alignment.fna before the tail command
If muscle doesn't create linebreaks you can skip creating alignment.fna, i.e.:
Oh also here I assume that muscle doesn't change the order of input sequences. This was the case with the example data. I don't know if it's the case with your data. Edit. Actually, muscle can change input order. It used to have an option -stable which prevented this, but it doesn't currently have that. So you are probably better off creating alignment.fna and then parsing your non-reference sequence from that with e.g. awk to the finalAlignment.fna file..
technically this will work indeed, but I fear a bit it might not make biological sense in the end. Final result will greatly depend on the order in which you align them.
Perhaps first calculating a distance matrix and then align them using the distance info might lead to better final result, but still ...
I was actually doing something very similar myself recently and settled on pairwise blastp alignments with the reference sequence as the subject sequence. All 5 million of your query sequences can be in a single fasta file as input making it straight forward to run. One of the advantages of blastp over say MUSCLE or clustal is the use of the blosum matrix for aiding in alignment - only blastp gave the expected result for some harder to align areas of my proteins.
You also have a lot of control over the format of the output from blastp which can aid in the downstream analysis of the results.
Blast is NOT an alignment program/tool, it's a search tool! Moreover it is at best a local aligner while what you are looking for here is a global aligner (very different implementation than local)
And again, as said elsewhere here, doing progressive pairwise alignment will not always give you the correct final result.
Also the blosum argument is not valid as also the other aligners use such matrices to calculate their alignment.
I see your point, but in my particular use case blast was the best option and did provide mostly full-length pairwise alignments but in my case the proteins I was aligning were all closely related. You can adjust parameters to tailor the alignment results to meet your needs, but you are right to point out that Blast is a local aligner first and foremost.
what kind of sequences do you have? (long? short?) Why do you want/think you need to use MSA ? and lastly, what have you tried/looked for already?
(aka, provide more info to get more meaningful replies)
I have protein sequences in fasta format. They correspond to a viral protein whose length is around 1000 amino acids. To clarify, I do not need MSA here. I have the wild type sequence and want to align all sequences wrt it. I had tried MUSCLE previously but that did not work here
none of the 'classical' aligners will be able to do this (align protein to DNA). They all work with same type of sequences.
what I think you are looking for is protein-mappers (rather in the gene prediction area than aligner area), perhaps things like genewise, genomethreader, ... might be of use.
Actually by wild type sequence I meant protein sequence and not the DNA/RNA. Sorry for the confusion. I am indeed mapping protein to protein here.
ok, makes more sense indeed.
Did you try the new clustal omega ? it should be able to take a huge number of input sequences to perform alignment with.
As an alternative, you can try to do progressive alignment, as in: take the first 100 sequences, align those, get the result and use that to align the next 100 to it. Tools such as t-coffee, muscle, emma, ... can also take a profile (==set of already aligned sequences) as input, as well as a fasta file.