Question

Join multiple sequence lines in to one.

0

Entering edit mode

6.5 years ago

MB ▴ 50

I have a file seq.txt which consists of multiple aligned sequences:

B_phora_cucurbitarum    -----------------------------------------------------MSHIKRD
E_aceosorus_bombacis    RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDV
A_nfragosa_RCEF_1005    ---------------------------------------------------MKYSILHLA
X_crodochium_bolleyi    -------------------------------------------MRLSNIAGQLAVGAACL
R_cillium_camemberti    --------------------------------------MRILTTGLLLWLLSLINLVSAF
[Bin                                                                               ]

B_phora_cucurbitarum    LSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
E_aceosorus_bombacis    RAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
A_nfragosa_RCEF_1005    ---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
X_crodochium_bolleyi    NDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
R_cillium_camemberti    -------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
[Bin                                                                               ]

I want to combine them in a single line according to their names like this:

>B_phora_cucurbitarum   -----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis   RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>A_nfragosa_RCEF_1005   ---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>X_crodochium_bolleyi   -------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
>R_cillium_camemberti   --------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP

Awk or sed commands are preferred, any help would be appreciated. Thanks!

alignment sequence awk sed • 2.1k views

ADD COMMENT • link updated 6.5 years ago by cpad0112 21k • written 6.5 years ago by MB ▴ 50

0

Entering edit mode

Did you try something yourself with some forums examples close to your question ?

https://stackoverflow.com/questions/18092469/combining-columns-within-a-single-file-using-awk

https://unix.stackexchange.com/questions/224135/merging-columns-in-a-file-using-awk

https://www.unix.com/shell-programming-and-scripting/267681-merge-columns-two-files-using-awk.html

https://stackoverflow.com/questions/7068314/awk-combining-multiple-lines-conditionally

https://www.unix.com/shell-programming-and-scripting/208027-merge-multiple-lines-same-file-common-key-using-awk.html

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

Yes, I tried but they don't make my case.

ADD REPLY • link 6.5 years ago by MB ▴ 50

score 2 · Accepted Answer · 2018-06-04

2

Entering edit mode

6.5 years ago

Bastien Hervé 6.0k

As I'm a goat in awk, I let you a python solution

###Create a dictionnary containing your seq_merge.txt
merged_dict={}
###Open your seq table
with open("seq_merge.txt", 'r') as f:
    for line in f:
        ###Do a key/value dictionnary
        id_seq = line.rstrip().split("\t")[0]
        seq = line.rstrip().split("\t")[1]
        ###Check if the key exists in the dictionnary
        if id_seq not in merged_dict:
            merged_dict[id_seq] = seq
        else:
            merged_dict[id_seq] += seq
###Write in new file
with open("new_seq_merge.txt", "a") as new_seq_merge:
    for key, value in merged_dict.iteritems():
        new_seq_merge.write(">"+key+"\t"+value+"\n")

ADD COMMENT • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

It is giving an error: Traceback (most recent call last): File "seq.py", line 8, in <module> seq = line.rstrip().split("\t")[1] IndexError: list index out of range

ADD REPLY • link 6.5 years ago by MB ▴ 50

0

Entering edit mode

What is your delimiter in this line : B_phora_cucurbitarum -----------------------------------------------------MSHIKRD ? Tabulation ? 4 spaces ?

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

it is tabulation....

ADD REPLY • link 6.5 years ago by MB ▴ 50

0

Entering edit mode

It's hard to investigate without your file.

Try to print line.split("\t") before id_seq = line.rstrip().split("\t")[0]

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

Does these lines really exist in your file ?

[Bin                                                                               ]

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

yes, they are. I found its the format problem, after converting to Unix format, it worked fine. Thanks a lot!

ADD REPLY • link 6.5 years ago by MB ▴ 50

0

Entering edit mode

goat= greatest of all time...

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

I was thinking about the animal

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

0

Entering edit mode

probably, typing before lunch?

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Nah, expression from my country to say that i'm very bad at awk writing

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.0k

score 2 · Accepted Answer · 2018-06-05

2

Entering edit mode

6.5 years ago

5heikki 11k

mkdir whatever
cp inputfile whatever
cd whatever
awk 'BEGIN{FS="\t";ORS=""}{print $2 >> $1}' inputfile
rm inputfile
for f in *; do awk -v N="$f" 'BEGIN{OFS="\t"}{print N,$0}' $f; done > newFile

I ignored the [Bin lines here. You can delete that file before the for loop

ADD COMMENT • link 6.5 years ago by 5heikki 11k

score 1 · Accepted Answer · 2018-06-04

1

Entering edit mode

6.5 years ago

kloetzl ★ 1.1k

cat seq.txt| awk 'BEGIN{c=1}!/\[/{if(NF){n[c]=$1;s[c]=s[c] $2;c++}}/\[/{c=1}END{for( i in n){print n[i], s[i]}}'

ADD COMMENT • link 6.5 years ago by kloetzl ★ 1.1k

0

Entering edit mode

Thanks but it's not working, it is just printing ---------------------------------------- in each line with some characters in between.

ADD REPLY • link 6.5 years ago by MB ▴ 50

score 1 · Accepted Answer · 2018-06-05

$ sed -n '/^$/d;/Bin/!p' test.txt| sed -e 's/\s\+/\t/g'  | sort -s -k 1,1 | datamash -g1 collapse 2 | sed 's/,//g'|  awk '{print ">"$1"\n"$2}'

>A_nfragosa_RCEF_1005
---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>B_phora_cucurbitarum
-----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis
RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>R_cillium_camemberti
--------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
>X_crodochium_bolleyi
-------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL

Install datamash either from here or from distro repos (for debian based; sudo apt install datamash -y; for conda, conda install datamash -y).