Hello guys, I'm a beginner in bioinformatics and I have the following problem:
I have a input file like this:
Sequence_1,RT_RH_A17
Sequence_1,RH_A17
Sequence_1,INT_PHD
Sequence_1,PR_Ap2
Sequence_1,Gag
Sequence_1,RH_A17
Sequence_1,INT_rve
Sequence_2,INT_rve
Sequence_3,INT_rve
Sequence_4,INT_rve
Sequence_5,INT_rve
Sequence_6,RT_like
Sequence_6,RH_A17
It represents an annotation of transposon domain per sequence, as we can see, have several lines for the same sequence (e.g. sequence_1). I want to create an output file like this:
Sequence_1,1,1,1,1,1
Sequence_2,0,0,0,0,1
Sequence_3,0,0,0,0,1
Sequence_4,0,0,0,0,1
Sequence_5,0,0,0,0,1
Sequence_6,0,0,1,1,0
The output represents a model to use in iTol (a tool to annotate phylogenetic trees). So I want the order of annotation as Gag,PR,RT,RH,INT, sequences that have one of those domains receive '1' in domain position or '0' if doesn't have.
I wrote a python script to try resolve this:
# open the table
domains=open('bel_domains.csv','r')
# create a output file
output = open('output.txt','w+')
# function to detect domains in lines
def found_domain(a,b):
if a in b:
output.write('1'+',')
else:
output.write('0'+',')
for line in domains:
line_split = line.split(',')
output.write(line_split[0]+',')
found_domain('Gag',line_split[1])
found_domain('PR',line_split[1])
found_domain('RT',line_split[1])
found_domain('RH',line_split[1])
found_domain('INT',line_split[1])
output.close()
I know that I have a huge problem in logic and maybe a hugest problem in programming skills, because my output its:
Sequence_1,0,0,1,1,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1,Sequence_1,0,1,0,0,0,Sequence_1,1,0,0,0,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1...
Briefly, I need to detect wich lines are equal in the position of sequence name, and read&annotate domains in the following order (Gag,PR,RT,RH,INT), but I have no idea how I can implement this in my script.
Can someone help me?
Thanks!
Thanks cschu181, it's work perfectly!