Remove gap from MSA alignment file
0
0
Entering edit mode
7.0 years ago
skjobs1234 ▴ 40
     import sys
    import re
    t=open(sys.argv[1],'r').read()
    #reading the template sequences
    blocks=re.split(r'>.*.pdb',t)
    #Reading all the sequences
    blocks1=t.split(">")

    count=[]
    #Calculate the start gap from the template
    for fa in blocks[1:]:
        fasta=fa.split('\n')
        for i in range(len(fasta)):
            #modulate the gap size per line here
#but I want to remove one by one character not per line
            if fasta[i]=='-'*60:
                count.append(i)

    #Get the maximum lines of gap from start in all template sequences
    count=[i for i in count if count.count(i)>1]
    j=list(set(count))
    out=[]
    #Removing the lines form query+template sequence file
    for fa1 in blocks1[1:]:
        fasta1=fa1.split('\n',)
        for i in sorted(j, reverse=True): 
            del fasta1[i]
        #Remove comment for commandline output
        #print ">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:]))
        out.append(">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:])))
    #Output file operation
    f = open(sys.argv[1],'w')
    for k in out:
        f.write(k)
    f.close()
    print "DONE\nOutput name: "+sys.argv[1]

This script is removing line by line string. I want to remove (-) character wise from input files. Please guide how to split into string wise rather than line (\n)

alignment sequence software error • 2.2k views
ADD COMMENT
0
Entering edit mode

What is your input file format?

ADD REPLY
0
Entering edit mode
Input file will be fasta sequence  Example:

 >P1_2vt4.pdb
 ---------------------------------------------------------WEA GMSLLMALVVLLIVAGNVLVIAAIGSTQRLQTLTNLFITSLACADLVVGLLVVPFGATLV
 -----------------------------------
 >P1_3kj6.pdb
 ------------------------------------------------------------ GMGIVMSLIVLAIVFGNVLVITAI-AFERLQTVTNYFITSLACADLVMGLAVVPFGA---
 LRRSS------------------------------
 >P1_F7EQ49 LSCLYTIFLFPIGFVGNILILVVNISFREKMTIPDLYFINLAVADLILVA-DSLIEVF-N
 LHEQYYDIAVLCTFMSLFLQVNMYSSVFFLTWMSFDRYIALARAMRCSLFRTKHHARLSC
 QKTNLPALNRFCHAALKAVIPDSTEQSDVRFSSAV

And I want output like this 1) Remove the gap if found continuous gap (-) more than 20 in >.pdb file. 2) Also remove at the same position from without >.pdb file like >P1_F7EQ49.. 3) Removing position should be same in all the files. Output should look like this

> >P1_2vt4.pdb 
>          WEA  GMSLLMALVVLLIVAGNVLVIAAIGSTQRLQTLTNLFITSLACADLVVGLLVVPFGATLV
>          ----------
>      >P1_3kj6.pdb
>     
>          ---  GMGIVMSLIVLAIVFGNVLVITAI-AFERLQTVTNYFITSLACADLVMGLAVVPFGA--- LRRSS
>          >P1_F7EQ49 
>     LSC  LHEQYYDIAVLCTFMSLFLQVNMYSSVFFLTWMSFDRYIALARAMRCSLFRTKHHARLSC QKTNL
ADD REPLY
0
Entering edit mode

Is there any specific reason why gaps are not removed in one go, instead of character wise?

ADD REPLY
0
Entering edit mode

Your formatting is still very janky, so I’m not sure I understand yet. The first point is straightforward, remove any gaps >20 characters in length. I don’t really understand the last 2 requests. You want to leave stretches of gaps within the sequence intact? So just remove - characters from the ends of the sequence?

Why exactly do you want to do this? You’ll be ruining the alignment..

ADD REPLY

Login before adding your answer.

Traffic: 2184 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6