Hello, I have a task which seems it should be simple but I haven't found a solution yet. I have several thousand fasta files, each containing an alignment of 30 samples. The headers of each entry are the sample name, and every file contains the same 30 samples. I would like to concatenate the sequences of each fasta file such that I have one fasta file with the 30 samples. For example:
Starting data:
Gene1.fasta
>Sample1
CCCCCCCCC
>Sample2
AAAAAAAAA
Gene2.fasta
>Sample1
TTTTTTTTTTTTTTT
>Sample2
GGGGGGGGGGGGGGG
Desired output:
AllGenes.fasta
>Sample1
CCCCCCCCCTTTTTTTTTTTTTTT
>Sample2
AAAAAAAAAGGGGGGGGGGGGGGG
So far the only solution I have come up with is this:
for sample in Sample1 Sample2 ; do echo ">$sample" > "$sample".temp.fasta ; for gene in Gene1 Gene2 ; do seqkit grep -p "$sample" "$gene".fasta | grep -v ">" >> "$sample".temp.fasta ; done ; done
cat *.temp.fasta > AllGenes.fasta
but that seems terribly inefficient for thousands of genes, is there a better way?
Thank you, that is perfect!