How can I create a single file (.fasta or .phylip) containing several genomes to build a phylogeny tree
1
1
Entering edit mode
7.4 years ago
kamel ▴ 70

Dear colleagues I want to build a phylogeny tree for several bacterial genomes but alignment tools like (muscle, clustalw) requires a single fasta file containing several genomes. How can I create a single file containing several genomes. If you have a proposal to get a single file format .phylip from more files (more genomes)

alignment blast software error Phylogeny • 5.3k views
ADD COMMENT
0
Entering edit mode

do you already have several genomes in several fasta files? if you do, or once you get them, use cat to concatenate the files.

ADD REPLY
0
Entering edit mode

I have already done cat to concatenate them but it does not work, I think because each file fasta that I have contains several contigs. If you have other proposal to get a file phylip format of these genomes complete or a method I can get the tree phylogeny. Thank you

ADD REPLY
1
Entering edit mode
7.4 years ago
Joe 21k

You'll need to show us some useful examples of your input data before we can help you very much.

If the problem is as you say, failing because you're trying to make a multifasta from existing multifastas, first concatenate all of the contigs:

$ cat genome1.fasta | sed '1!{/^>.*/d;}' > genome1_concatenated.fasta
$ cat genome2.fasta | sed '1!{/^>.*/d;}' > genome2_concatenated.fasta
.
.
.
$ cat genomeN.fasta | sed '1!{/^>.*/d;}' > genomeN_concatenated.fasta

(You can loop this if you have too many to handle). sed -i '1!{/^>.*/d;}' genome1.fasta will edit-in-place if you prefer to do that.

If you're interested, what this command is doing is saying:

Ignoring the first occurrence (1!{}), if the line begins with a ">" (^>), followed by any number of occurrences of any character (.*), delete that line (/d).

Hopefully it's obvious that this means all your sequences will now be under whatever fasta header the first sequence in that fasta had. You can edit this yourself if you want something else.

Then, concatenate the concatenated files:

$ cat *_concatenated.fasta > all_genomes.fasta

And then do your alignments.

A word to the wise though, if you're trying to align whole genomes, clustal and muscle aren't up to the task.

ADD COMMENT
0
Entering edit mode

Yes you are right when I do the alignment it gives me that the size of file is great it works that with mauve but mauve does not give me an output file which serves to the phylogeny. Do you have a proposal to build a phylogentic tree in my case

ADD REPLY
0
Entering edit mode

Once you have a mutliple alignment, you can build a tree in many ways. RAxML and PhyML are common and robust programs but I have no idea how they will handle a whole genome sized alignment. fasttree might be a decent option here if the speed is an issue.

ADD REPLY
0
Entering edit mode

Thanks for your answer, do you have a method to extract the consensus sequences and use it for alignment instead of using full genomes

ADD REPLY
0
Entering edit mode

I'm not sure what you mean. If you've already aligned the genomes, why do you want to align to a consensus sequence?

I think you should expand your original question to tell us exactly what it is you are trying to do. Really large alignments are often poor quality, so I question your approach currently.

If you're determined to get a consensus sequence, take a look at my answer here: A: Protein Sequence Analyses

ADD REPLY
0
Entering edit mode

Dear healey, I try to build a phylogeny tree and I have already concatenated the fasta files into a single file Fatsa by your command but I do not know how to convert a tree from the complete genomes, which is why I have reflected To extract the consenus sequenques but I think I deceived And to use RAxML or PhyML I need a file format .phylip, now I'm looking for a simpler method to build a phylogeny tree. thank you for your time

ADD REPLY
1
Entering edit mode

What file format is your alignment in? You need to convert it to PHYLIP which can be done in a myriad of ways.

With respect, you should read up on the process of making a phylogenetic tree if you're stuck on these steps. It's quite basic and fundamental, and the forum is not a place well suited to baby-steps through a process.

For reference, you need to:

Get Sequences (concatenated) > Create an alignment > Convert that alignment to phylip (if necessary) > Input the alignment file in to whatever tree software you like.

And for the record, I believe FastTree supports an aligned multifasta input , if that helps.

ADD REPLY
0
Entering edit mode

Sir healey, I know how the process works to build a phylogeny tree and with all the steps. Just to tell you I'm stuck because the alignment tools like muscle, maft or clustal do not work well with complete genomes (thanks for the idea of concatenated the fasta files but the alignment process by the tools that i ' I mentioned is very slow). Thank you for your help

ADD REPLY
0
Entering edit mode

I'm still not exactly sure what your question is... but as I told you, CLUSTAL and Muscle won't handle a whole genome assemblies well. I'm not sure about MAFFT, but I suspect it will be the same.

For large pairwise sequences, use MUMmer. For multiple sequence alignment, try LAST. I've had some luck with Kalign in the past for large sequences, though not the full size of a genome.

Do you really need to do a whole genome alignment for whatever your question is? (Which you still haven't told us). Perhaps you could just use a locus typing approach or similar instead if you don't have the computing power for this.

ADD REPLY

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6