Question

Combining two fasta sequences into one

2

Entering edit mode

7.9 years ago

Lille My ▴ 30

I have two fasta files, with the same headers/names for the sequences but different sequences. I would like to combine them into one file, so that each sequence has the same name but is a combination of both sequences. My preferred language is bash script, but I'm open to other suggestions. thanks.

sequence • 11k views

ADD COMMENT • link updated 7.9 years ago by Pierre Lindenbaum 164k • written 7.9 years ago by Lille My ▴ 30

0

Entering edit mode

with the same headers/names for the sequences but different sequences

uhh ?

would like to combine them into one file, s

an example is needed

ADD REPLY • link 7.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Like this?

File_1:

>Seq_1
ACGCTAGCTA
>Seq_2
CGCTAGCTC

File_2:

>Seq_1
GCTGAT
>Seq_2
TTACTC

File_1 + File_2 = File_3

>Seq_1
ACGCTAGCTAGCTGAT
>Seq_2
CGCTAGCTCTTACTC

ADD REPLY • link 7.9 years ago by GenoMax 148k

0

Entering edit mode

yes, your example is exactly what I need to do.

ADD REPLY • link 7.9 years ago by Lille My ▴ 30

0

Entering edit mode

Does this make biological sense?

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Sometimes it does, depends on what kind of sequences you have.

ADD REPLY • link 7.9 years ago by Lille My ▴ 30

score 5 · Accepted Answer · 2017-01-15

A solution using seqkit~~, csvtk and shell sed.~~

Sample files (not in same order, can be multiple lines):

$ cat 1.fa
>seq1
aaa
aa
>seq2
ccc
cc
>seq3
ggg
gg

$ cat 2.fa
>seq3
TTT
TT
>seq2
GGG
GG
>seq1
CCC
CC

Just one command:

$ seqkit concat 1.fa 2.fa
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT

Step 1. Convert FASTA to tab-delimited (3 columns, the 3rd column is blank (no quality for FASTA)) file:

$ seqkit fx2tab 1.fa > 1.fa.tsv $ seqkit fx2tab 2.fa > 2.fa.tsv $ cat -A 1.fa.tsv seq1^Iaaaaa^I$ seq2^Iccccc^I$ seq3^Iggggg^I$

Step 2. Merge two table files:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | cat -A seq1^Iaaaaa^I^ICCCCC^I$ seq2^Iccccc^I^IGGGGG^I$ seq3^Iggggg^I^ITTTTT^I$

Step 3. Note that there are two TAB between the two sequences, so we can remove them to join the sequences

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//' seq1 aaaaaCCCCC seq2 cccccGGGGG seq3 gggggTTTTT

Step 4. Convert tab-delimited file back to FASTA file:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//' | seqkit tab2fx >seq1 aaaaaCCCCC >seq2 cccccGGGGG >seq3 gggggTTTTT

All in one command:

$ csvtk join -H -t <(seqkit fx2tab 1.fa) <(seqkit fx2tab 2.fa) | sed 's/\t\t//' | seqkit tab2fx

score 4 · Accepted Answer · 2017-01-15

4

Entering edit mode

7.9 years ago

Pierre Lindenbaum 164k

assuming there are only twho lines per sequence (title/dna) and they are ordered the same way.

paste  f1.fa f2.fa | sed -e 's/\t>.*//' -e 's/\t//'

ADD COMMENT • link 7.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

there are more lines, but I can change them into one liners. I will try this out. thanks!

ADD REPLY • link 7.9 years ago by Lille My ▴ 30