Combining two fasta sequences into one
2
2
Entering edit mode
7.9 years ago
Lille My ▴ 30

I have two fasta files, with the same headers/names for the sequences but different sequences. I would like to combine them into one file, so that each sequence has the same name but is a combination of both sequences. My preferred language is bash script, but I'm open to other suggestions. thanks.

sequence • 11k views
ADD COMMENT
0
Entering edit mode

with the same headers/names for the sequences but different sequences

uhh ?

would like to combine them into one file, s

an example is needed

ADD REPLY
0
Entering edit mode

Like this?

File_1:

>Seq_1
ACGCTAGCTA
>Seq_2
CGCTAGCTC

File_2:

>Seq_1
GCTGAT
>Seq_2
TTACTC

File_1 + File_2 = File_3

>Seq_1
ACGCTAGCTAGCTGAT
>Seq_2
CGCTAGCTCTTACTC
ADD REPLY
0
Entering edit mode

yes, your example is exactly what I need to do.

ADD REPLY
0
Entering edit mode

Does this make biological sense?

ADD REPLY
0
Entering edit mode

Sometimes it does, depends on what kind of sequences you have.

ADD REPLY
5
Entering edit mode
7.9 years ago

A solution using seqkit, csvtk and shell sed.

Sample files (not in same order, can be multiple lines):

$ cat 1.fa
>seq1
aaa
aa
>seq2
ccc
cc
>seq3
ggg
gg

$ cat 2.fa
>seq3
TTT
TT
>seq2
GGG
GG
>seq1
CCC
CC

Just one command:

$ seqkit concat 1.fa 2.fa
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT

Step 1. Convert FASTA to tab-delimited (3 columns, the 3rd column is blank (no quality for FASTA)) file:

$ seqkit fx2tab 1.fa > 1.fa.tsv
$ seqkit fx2tab 2.fa > 2.fa.tsv

$ cat -A 1.fa.tsv 
seq1^Iaaaaa^I$
seq2^Iccccc^I$
seq3^Iggggg^I$

Step 2. Merge two table files:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | cat -A
seq1^Iaaaaa^I^ICCCCC^I$
seq2^Iccccc^I^IGGGGG^I$
seq3^Iggggg^I^ITTTTT^I$

Step 3. Note that there are two TAB between the two sequences, so we can remove them to join the sequences

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//'
seq1    aaaaaCCCCC
seq2    cccccGGGGG
seq3    gggggTTTTT

Step 4. Convert tab-delimited file back to FASTA file:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//' | seqkit tab2fx
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT

All in one command:

$ csvtk join -H -t <(seqkit fx2tab 1.fa) <(seqkit fx2tab 2.fa) | sed 's/\t\t//' | seqkit tab2fx

ADD COMMENT
0
Entering edit mode

thanks! I will try this out.

ADD REPLY
4
Entering edit mode
7.9 years ago

assuming there are only twho lines per sequence (title/dna) and they are ordered the same way.

paste  f1.fa f2.fa | sed -e 's/\t>.*//' -e 's/\t//'
ADD COMMENT
0
Entering edit mode

there are more lines, but I can change them into one liners. I will try this out. thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6