Question

Concatenate multifastas

0

Entering edit mode

3.5 years ago

Colaptes ▴ 100

Hello, I have a task which seems it should be simple but I haven't found a solution yet. I have several thousand fasta files, each containing an alignment of 30 samples. The headers of each entry are the sample name, and every file contains the same 30 samples. I would like to concatenate the sequences of each fasta file such that I have one fasta file with the 30 samples. For example:

Starting data:

Gene1.fasta

>Sample1
CCCCCCCCC
>Sample2
AAAAAAAAA

Gene2.fasta

>Sample1
TTTTTTTTTTTTTTT
>Sample2
GGGGGGGGGGGGGGG

Desired output:
AllGenes.fasta

>Sample1
CCCCCCCCCTTTTTTTTTTTTTTT
>Sample2
AAAAAAAAAGGGGGGGGGGGGGGG

So far the only solution I have come up with is this:

for sample in Sample1 Sample2 ; do echo ">$sample" > "$sample".temp.fasta ; for gene in Gene1 Gene2 ; do seqkit grep -p "$sample" "$gene".fasta | grep -v ">" >> "$sample".temp.fasta ; done ; done
cat *.temp.fasta > AllGenes.fasta

but that seems terribly inefficient for thousands of genes, is there a better way?

join fasta concatenate multifasta • 704 views

ADD COMMENT • link 3.5 years ago by Colaptes ▴ 100

score 2 · Accepted Answer · 2021-06-01

2

Entering edit mode

3.5 years ago

GenoMax 147k

See answers here: Combining two fasta sequences into one

I recommend you use seqkit concat.

ADD COMMENT • link 3.5 years ago by GenoMax 147k

0

Entering edit mode

Thank you, that is perfect!

ADD REPLY • link 3.5 years ago by Colaptes ▴ 100