Question

Help with iTol annotation python script

0

Entering edit mode

6.3 years ago

heuristic • 0

Hello guys, I'm a beginner in bioinformatics and I have the following problem:

I have a input file like this:

Sequence_1,RT_RH_A17
Sequence_1,RH_A17
Sequence_1,INT_PHD
Sequence_1,PR_Ap2
Sequence_1,Gag
Sequence_1,RH_A17
Sequence_1,INT_rve
Sequence_2,INT_rve
Sequence_3,INT_rve
Sequence_4,INT_rve
Sequence_5,INT_rve
Sequence_6,RT_like
Sequence_6,RH_A17

It represents an annotation of transposon domain per sequence, as we can see, have several lines for the same sequence (e.g. sequence_1). I want to create an output file like this:

Sequence_1,1,1,1,1,1
Sequence_2,0,0,0,0,1
Sequence_3,0,0,0,0,1
Sequence_4,0,0,0,0,1 
Sequence_5,0,0,0,0,1
Sequence_6,0,0,1,1,0

The output represents a model to use in iTol (a tool to annotate phylogenetic trees). So I want the order of annotation as Gag,PR,RT,RH,INT, sequences that have one of those domains receive '1' in domain position or '0' if doesn't have.

I wrote a python script to try resolve this:

# open the table
domains=open('bel_domains.csv','r')
# create a output file
output = open('output.txt','w+')
# function to detect domains in lines
def found_domain(a,b):
    if a in b:
        output.write('1'+',')
    else:
        output.write('0'+',')

for line in domains:
    line_split = line.split(',')
    output.write(line_split[0]+',')
    found_domain('Gag',line_split[1])
    found_domain('PR',line_split[1])
    found_domain('RT',line_split[1])
    found_domain('RH',line_split[1])
    found_domain('INT',line_split[1])


output.close()

I know that I have a huge problem in logic and maybe a hugest problem in programming skills, because my output its:

Sequence_1,0,0,1,1,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1,Sequence_1,0,1,0,0,0,Sequence_1,1,0,0,0,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1...

Briefly, I need to detect wich lines are equal in the position of sequence name, and read&annotate domains in the following order (Gag,PR,RT,RH,INT), but I have no idea how I can implement this in my script.

Can someone help me?

Thanks!

python itol script • 1.8k views

ADD COMMENT • link updated 6.3 years ago by cschu181 ★ 2.8k • written 6.3 years ago by heuristic • 0

score 4 · Accepted Answer · 2019-04-02

4

Entering edit mode

6.3 years ago

cschu181 ★ 2.8k

domain_order = ["Gag", "PR", "RT", "RH", "INT"]

domains = dict()
for line in open("bel_domains.csv"):
    seq, domain_info = line.strip().split(",")
    domains.setdefault(seq, set()).update(domain_info.split("_"))

with open("output.txt", "wt") as out:
    for seq in sorted(domains):
        contains_domains = [int(dom in domains[seq]) for dom in domain_order]
        print(seq, *contains_domains, sep=",", file=out)