Join multiple sequence lines in to one.
4
0
Entering edit mode
6.5 years ago
MB ▴ 50

I have a file seq.txt which consists of multiple aligned sequences:

B_phora_cucurbitarum    -----------------------------------------------------MSHIKRD
E_aceosorus_bombacis    RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDV
A_nfragosa_RCEF_1005    ---------------------------------------------------MKYSILHLA
X_crodochium_bolleyi    -------------------------------------------MRLSNIAGQLAVGAACL
R_cillium_camemberti    --------------------------------------MRILTTGLLLWLLSLINLVSAF
[Bin                                                                               ]

B_phora_cucurbitarum    LSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
E_aceosorus_bombacis    RAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
A_nfragosa_RCEF_1005    ---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
X_crodochium_bolleyi    NDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
R_cillium_camemberti    -------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
[Bin                                                                               ]

I want to combine them in a single line according to their names like this:

>B_phora_cucurbitarum   -----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis   RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>A_nfragosa_RCEF_1005   ---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>X_crodochium_bolleyi   -------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
>R_cillium_camemberti   --------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP

Awk or sed commands are preferred, any help would be appreciated. Thanks!

alignment sequence awk sed • 2.1k views
ADD COMMENT
2
Entering edit mode
6.5 years ago

As I'm a goat in awk, I let you a python solution

###Create a dictionnary containing your seq_merge.txt
merged_dict={}
###Open your seq table
with open("seq_merge.txt", 'r') as f:
    for line in f:
        ###Do a key/value dictionnary
        id_seq = line.rstrip().split("\t")[0]
        seq = line.rstrip().split("\t")[1]
        ###Check if the key exists in the dictionnary
        if id_seq not in merged_dict:
            merged_dict[id_seq] = seq
        else:
            merged_dict[id_seq] += seq
###Write in new file
with open("new_seq_merge.txt", "a") as new_seq_merge:
    for key, value in merged_dict.iteritems():
        new_seq_merge.write(">"+key+"\t"+value+"\n")
ADD COMMENT
0
Entering edit mode

It is giving an error: Traceback (most recent call last): File "seq.py", line 8, in <module> seq = line.rstrip().split("\t")[1] IndexError: list index out of range

ADD REPLY
0
Entering edit mode

What is your delimiter in this line : B_phora_cucurbitarum -----------------------------------------------------MSHIKRD ? Tabulation ? 4 spaces ?

ADD REPLY
0
Entering edit mode

it is tabulation....

ADD REPLY
0
Entering edit mode

It's hard to investigate without your file.

Try to print line.split("\t") before id_seq = line.rstrip().split("\t")[0]

ADD REPLY
0
Entering edit mode

Does these lines really exist in your file ?

[Bin                                                                               ]
ADD REPLY
0
Entering edit mode

yes, they are. I found its the format problem, after converting to Unix format, it worked fine. Thanks a lot!

ADD REPLY
0
Entering edit mode

goat= greatest of all time...

ADD REPLY
0
Entering edit mode

I was thinking about the animal

ADD REPLY
0
Entering edit mode

probably, typing before lunch?

ADD REPLY
0
Entering edit mode

Nah, expression from my country to say that i'm very bad at awk writing

ADD REPLY
2
Entering edit mode
6.5 years ago
5heikki 11k
mkdir whatever
cp inputfile whatever
cd whatever
awk 'BEGIN{FS="\t";ORS=""}{print $2 >> $1}' inputfile
rm inputfile
for f in *; do awk -v N="$f" 'BEGIN{OFS="\t"}{print N,$0}' $f; done > newFile

I ignored the [Bin lines here. You can delete that file before the for loop

ADD COMMENT
1
Entering edit mode
6.5 years ago
kloetzl ★ 1.1k
cat seq.txt| awk 'BEGIN{c=1}!/\[/{if(NF){n[c]=$1;s[c]=s[c] $2;c++}}/\[/{c=1}END{for( i in n){print n[i], s[i]}}'
ADD COMMENT
0
Entering edit mode

Thanks but it's not working, it is just printing ---------------------------------------- in each line with some characters in between.

ADD REPLY
1
Entering edit mode
6.5 years ago
$ sed -n '/^$/d;/Bin/!p' test.txt| sed -e 's/\s\+/\t/g'  | sort -s -k 1,1 | datamash -g1 collapse 2 | sed 's/,//g'|  awk '{print ">"$1"\n"$2}'

>A_nfragosa_RCEF_1005
---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>B_phora_cucurbitarum
-----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis
RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>R_cillium_camemberti
--------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
>X_crodochium_bolleyi
-------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL

Install datamash either from here or from distro repos (for debian based; sudo apt install datamash -y; for conda, conda install datamash -y).

ADD COMMENT

Login before adding your answer.

Traffic: 1491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6