Hi, My protein of interest is a multiple domain protein, and I would like to prepare a "structure based multiple sequence alignment" of it (around 1000 sequences).
Since the available structures of my protein are in different conformational states, I had to make multiple sequence alignments for EACH DOMAIN SEPARATELY. Now I have 6 high quality alignments, corresponding to the six domains of my protein.
The question is that: How can I merge/attach/join these files together? Notes:
- I am new to programming!
- All files are in Fasta format (also i have them in Clustal format)
- The length of ALL sequences in each file is the same (but the length of sequences in different files are different. For example in File 1, the length of all sequences is 120, and in file 2 all of them are 78)
- Sequences are arranged base on the "ID" in these files (i.e. the sequence IDs in the files are the same & are in the same order)
- I've used cat command, but it just added sequences at the end of file!
- Example files are presented below (I have shown only the first 4 sequences. there are around 1000 sequences). they are not real sequences.
I appreciate your help regarding to this matter, and please let me know if you need any more information.
Cheers, Ali
#
#
File 1:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII
File 2:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GWIAKKI-AAAA
File 3:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GGYYYKLIAKKI-AAAA
File 4:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--------HHWSGYYYKLIAKKI-AAAA--D
File 5:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
AEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
EEE
File 6:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
A
So the final file should look like this: Result file:
>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DAEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
show us a few lines of your fasta files please...
Updated my question. please let me know if you need any more information.
That looks like CLUSTAL format, or something, but is not at all fasta.
It's called FASTA multiple alignment format; similar to FASTA but gap characters ("-") are allowed - http://www.bioperl.org/wiki/FASTA_multiple_alignment_format.
do you want to merge the sequences together or files? basically, it is the same ID?
I didn't get your question but if you are trying to join files side by side instead of end to end, you can try "join" in UNIX. Check here: http://www.albany.edu/~ig4895/join.htm
Does all of your 1000 sequences have their name ending with "SecA"?