Hi,
I am using a pipeline called PanX for phylogenomic study. One of my aim is to build a core-genome phylogenomic tree. However, the output fasta files (~1100 gene files) for core-genome gene set are all separated by gene as below:
GeneA
>My_bacteriaA_GeneA
atgatg
>My_bacteriaB_GeneA
atgaag
>My_bacteriaC_GeneA
atgatg
GeneB
>My_bacteriaB_GeneB
atggtc
>My_bacteriaC_GeneB
atggtc
>My_bacteriaA_GeneB
atggta
And on and on until...
GeneZ
>My_bacteriaA_GeneZ
atggta
>My_bacteriaC_GeneZ
atggta
>My_bacteriaB_GeneZ
atggtg
I wish to have a concatenated fasta file that combined every core-genes for each bacteria as below to build a phylogenomic tree:
>My_bacteriaA
atgatgatggta...atggta
>My_bacteriaB
atgaagatggtc...atggtg
>My_bacteriaC
atgatgatggtc...atggta
May I know how can I create the concatenated fasta file? Please note that the order of the gene sequences for different bacteria of each gene in each fasta file are random, not in a particular order.
Thank you!
Hi, SEDA (https://www.sing-group.org/seda) has an operation named "Concatenate sequences" (https://www.sing-group.org/seda/manual/operations.html#concatenate-sequences) aimed to do what you ask for. However, to use it, you need that the sequence names or identifiers to be identical, so you will need to perform a "Rename header" operation (https://www.sing-group.org/seda/manual/operations.html#rename-header) in order to split your sequence headers in "My_bacteriaA GeneA". Regards.
Hello fec2,
should the (GeneA), (GeneB) ... should be inserted in the concatenated sequence? I guess this will be a problem in further downstream analysis.
Are sequences always in one line or is this a multiline fasta?
fin swimmer
Hi, the (GeneA), (GeneB) .. shouldn't be inserted, I put it in my question just to show I want to combine all the gene sequences. I have removed it to avoid confusion. Thank you.