Entering edit mode
7.0 years ago
skjobs1234
▴
40
import sys
import re
t=open(sys.argv[1],'r').read()
#reading the template sequences
blocks=re.split(r'>.*.pdb',t)
#Reading all the sequences
blocks1=t.split(">")
count=[]
#Calculate the start gap from the template
for fa in blocks[1:]:
fasta=fa.split('\n')
for i in range(len(fasta)):
#modulate the gap size per line here
#but I want to remove one by one character not per line
if fasta[i]=='-'*60:
count.append(i)
#Get the maximum lines of gap from start in all template sequences
count=[i for i in count if count.count(i)>1]
j=list(set(count))
out=[]
#Removing the lines form query+template sequence file
for fa1 in blocks1[1:]:
fasta1=fa1.split('\n',)
for i in sorted(j, reverse=True):
del fasta1[i]
#Remove comment for commandline output
#print ">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:]))
out.append(">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:])))
#Output file operation
f = open(sys.argv[1],'w')
for k in out:
f.write(k)
f.close()
print "DONE\nOutput name: "+sys.argv[1]
This script is removing line by line string. I want to remove (-) character wise from input files. Please guide how to split into string wise rather than line (\n)
What is your input file format?
And I want output like this 1) Remove the gap if found continuous gap (-) more than 20 in >.pdb file. 2) Also remove at the same position from without >.pdb file like >P1_F7EQ49.. 3) Removing position should be same in all the files. Output should look like this
Is there any specific reason why gaps are not removed in one go, instead of character wise?
Your formatting is still very janky, so I’m not sure I understand yet. The first point is straightforward, remove any gaps >20 characters in length. I don’t really understand the last 2 requests. You want to leave stretches of gaps within the sequence intact? So just remove
-
characters from the ends of the sequence?Why exactly do you want to do this? You’ll be ruining the alignment..