How to concatenate clustalw aligned fasta sequence?
1
0
Entering edit mode
4.8 years ago
Kumar ▴ 120

I need to concatenate clustalw aligned multifasta alignment file. I have tried many concatenating tools, which is not recognising the gaps (---) of the alignment and ultimately end up with error stating that sequence dissimilar length. Is there any way to do the same. For example I have alignment file a.fasta and b.fasta as given below,

a.fasta
>or1
ATGTCT-----TGA
>or2
-----TGATAG-----
>or3
TGATA-----TAGTT

b.fasta
>or2
ATGATGTATGATGATA
>or1
GTAGATAGATAGAG
>or3
ATGCTAGATAGATAG

The expected output is given below,

c.fasta
>or1
ATGTCT-----TGAGTAGATAGATAGAG
>or2
-----TGATAG-----ATGATGTATGATGATA
>or3
TGATA-----TAGTT ATGCTAGATAGATAG
alignment fasta sequence clustalw • 3.6k views
ADD COMMENT
1
Entering edit mode

are all those sequences (aligned ones) on a single line ?

if so then someting like this should work:

paste <(cat a.fasta | paste - - | sort) <(cat b.fasta | paste - - | sort) | awk '{ print $1,$2$4 }' | sed 's/ /\n/g'
ADD REPLY
0
Entering edit mode

Dear @ lieven.sterck, Here, I showed an example sequence. But, in reality, I have multiple lined fasta file. All the sequences are not in single line.

ADD REPLY
0
Entering edit mode

you can first linearise them if that does not hinder the downstream analysis? eg using this:

https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

but I must admit that shenwei356 solution using the seqkit concat is much neater ;)

ADD REPLY
0
Entering edit mode

Thank you @lieven.sterck for your valuable suggestion.

ADD REPLY
0
Entering edit mode

-update: I didn't realize that b.fasta was aligned. The following method only works if the second fasta file is unaligned. If you have the raw sequences for b.fasta you can use this method. Specially if you repeatedly want to align sequences with a.fasta you can have a profile alignment of a.fasta (a.hmm) and use hmmalign to align other sequences to a.fasta.

You can use hmmbuild to build a profile from a.fasta and then use hmmalign to align b.fasta to the hmm profile.

hmmbuild a.hmm a.fasta

Or even better: Convert aligned fasta to stockholm

hmmbuild a.hmm a.sto
hmmalign with -mapali option
hmmalign -o c.sto --mapali a.sto a.hmm b.fasta
hmmalign --outformat A2M -o c.fasta --mapali a.fasta a.hmm b.fasta

I think hmmalign works best with stockholm format but you can try it with aligned fasta (a.fasta) and see if it works.

ADD REPLY
0
Entering edit mode

Thank you @Fatima, However it shows following error,

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ hmmbuild a.hmm a.fasta 
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             a.fasta
# output HMM file:                  a.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen     W eff_nseq re/pos description
#---- -------------------- ----- ----- ----- ----- -------- ------ -----------
Alignment input parse error:
   sequence or2 has alen 16; expected 14
   while reading aligned FASTA file a.fasta
   at or near line 5
ADD REPLY
0
Entering edit mode

All sequences in an aligned fasta should have the same length. or1 is 14 but or2 is 16.

ADD REPLY
0
Entering edit mode

hi Fatima ,

can you elaborate on how (or why) this is answering the original question?

if the b.fasta alignments do not cover the same region as a.fasta, then the hmm-align will also not align them, no?

ADD REPLY
0
Entering edit mode

My bad. Because b.fasta didn't have any gaps I thought it's an unaligned fasta.

ADD REPLY
4
Entering edit mode
4.8 years ago

seqkit concat

seqkit concat a.fasta b.fasta > c.fasta
ADD COMMENT
0
Entering edit mode

Thank you @shenwei365. It works fine. But, I could not output the result in new file. It shows result in terminal only, when I output the result using this syntax > c.fasta, which show following error.

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ ~/Documents/tools/seqkit/seqkit concat a.fasta b.fasta > c.fasta 
[INFO] read file: a.fasta
[INFO] 3 records loaded
[INFO] read file: b.fasta
[INFO] 3 records loaded

And, when I am not specifying the output, the results showed on the terminal as follows,

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ ~/Documents/tools/seqkit/seqkit concat a.fasta b.fasta 
[INFO] read file: a.fasta
[INFO] 3 records loaded
[INFO] read file: b.fasta
[INFO] 3 records loaded
>or1
ATGTCT-----TGAGTAGATAGATAGAG
>or2
-----TGATAG-----ATGATGTATGATGATA
>or3
TGATA-----TAGTATGCTAGATAGATAG
ADD REPLY
2
Entering edit mode

Did you look to see if c.fasta was made? In first case that is all you should see on screen. That would be the default behavior of the program when things work.

ADD REPLY
0
Entering edit mode

Sorry for my misinterpretation. @shenwei356 suggestion works well.

ADD REPLY

Login before adding your answer.

Traffic: 1908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6