Question

Concatenate Two .Fasta Files Into One

0

Entering edit mode

10.9 years ago

Alice ▴ 320

Hello, biostars! I have two fasta files for two different genes and want to create one data matrix. Is there any function in R for that? F.ex. if I have 2 DNAbin objects for that genes. Id numbers are identical in both files. I have the first file:

>sp1
aacc
>sp2
ggtt

the second file:

>sp1
ggaa
>sp2
ttgg

I want:

>sp1
aaccggaa
>sp2
ggttttgg

Python is also OK, but i'm interested in R.

fasta r • 15k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 10.9 years ago by Alice ▴ 320

0

Entering edit mode

Could you comment on the rationale behind what you're trying to do?

ADD REPLY • link 10.9 years ago by Biojl ★ 1.7k

0

Entering edit mode

In few words: concatenated sequence matrix -> alignment -> phylogenetic tree

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

Is it some kind of homework question. I answered the same question 4-5 days back. See ehre: C: Combining dna sequences files into one

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

No, it's for my lab work. Your answer is also helpful, thanks.

ADD REPLY • link 10.9 years ago by Alice ▴ 320

Ram · Answer 1 · 2013-12-27

4

Entering edit mode

10.9 years ago

Devon Ryan 104k

Just cbind(A,B) to merge the sequences for DNAbin A and DNAbin B:

A.fa:

>sp1
aacc
>sp2
ggtt

B.fa:

>sp1
ggaa
>sp2
ttgg

In R using DNAbin (as you requested):

library(ape)
A <- read.dna("A.fa", format="fasta")
B <- read.dna("B.fa", format="fasta")
C <- cbind(A,B)
write.dna(C, "C.fa", format="fasta")

C.fa:

>sp1
aaccggaa
>sp2
ggttttgg

See help(DNAbin) for more details about options for cbind(), particularly fill.with.gaps and check.names.

ADD COMMENT • link 10.9 years ago by Devon Ryan 104k

0

Entering edit mode

I've already tried that. Error: the 'cbind' method for "DNAbin" accepts only matrices

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

How did you read in the sequences?

ADD REPLY • link 10.9 years ago by Devon Ryan 104k

0

Entering edit mode

read.dna("B.fa", format="fasta") - fail read.FASTA("B.fasta") - fail

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

If you get an error message of "fail" or something like that, then you have bigger issues.

ADD REPLY • link 10.9 years ago by Devon Ryan 104k

0

Entering edit mode

by "fail" i mean the same error message in both cases: cbind' method for "DNAbin" accepts only matrices

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

It would be helpful if you posted a reproducible example. The original examples in your question will work fine.

ADD REPLY • link 10.9 years ago by Devon Ryan 104k

0

Entering edit mode

I think problem is in lines, i.e. one sequence is like:

>sp1
aattgg
aaggtt

and not

>sp1
aattggaaggtt

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

Worked for me.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by pescadordigital ▴ 10

Ram · Answer 2 · 2013-12-27

4

Entering edit mode

10.9 years ago

Haluk ▴ 190

You can do this with an awk

paste A.fa B.fa | awk '{if (NR%2==0) {print $1 $2} else {print $1}}'

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 10.9 years ago by Haluk ▴ 190

0

Entering edit mode

Thank you! It works. I have absolutely no experience with awk, so i have one question: the order of IDs in A.fa have to be the same, as in B.fa? Or concatenation goes by comparing IDs in two files?

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

They have to be the same and each sequence can occupy only 1 line.

ADD REPLY • link 10.9 years ago by Devon Ryan 104k

0

Entering edit mode

Ok, thanks, it is really important.

ADD REPLY • link 10.9 years ago by Alice ▴ 320

0

Entering edit mode

paste -d '\0' File_A File_B | sed 's/>[A-Z]*//' > File_C.fa will also do the same.

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k