Hi,
I have sequences in two text files. One of them contains sequences in a number of contigs and the other have positions for ORFs in same contigs. I need to compare those two checking if sequences are within ORFs. I made two dictionaries from both files with contig as a key. Then I wanted to compare positions of my sequences and my ORFs to check if sequence is within an orf. And I'm kind of stuck and not sure how to tackle it right now. Thanks in advance for any help!
I'm using Python 3
Part of my seq.txt (contig, query id, query start, query end, subject start, subject end - I concentrate on contig as a key, query id because I need those for the output, and subject start and end becuase this is the thing I am comparing with orfs positions ) :
NW_005871048.1 gi|63108247|dbj|BAD98245.1| 347 549 24077230 24077775
NW_005871048.1 gi|334705|gb|AAA47712.1| 327 504 24077224 24077706
NW_005871048.1 gi|288549222|gb|ADC52789.1| 320 479 17303856 17304323
NW_005871048.1 gi|288549222|gb|ADC52789.1| 336 500 24077257 24077706
NW_005871049.1 gi|8272468|gb|AAF74215.1| 267 412 17303904 17304350
NW_005871049.1 gi|21070180|gb|AAM34209.1| 478 622 5615586 5616002
NW_005871049.1 gi|3309126|gb|AAC31806.1| 531 661 5615616 5615993
NW_005871048.1 gi|55979255|gb|AAV68489.1| 470 600 5615616 5615993
NW_005871050.1 gi|334705|gb|AAA47712.1| 443 561 5615637 5615993
Part of my orf.txt (contig - as a key, start, end, seq len, frame, seq start, seq end -I need only last two values for my comparison):
NW_005871048.1 NNLQMNLSVLLPMQELHPWPHPRLMHFSPV KRT 223 -1 0 666
NW_005871048.1 GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG VGR 103 1 135 447
NW_005871048.1 ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS ESC 124 1 836 1211
NW_005871048.1 LSDLPALGWLHAGSAESPASSPHHPHNFLL LRK 113 1 877 1219
NW_005871048.1 YRQIYMPRPRRTANTMNNHGNEDDQKENEK HSL 156 1 2204 2675
NW_005871048.1 IFESKEQTEYLVTSLQRGPFQVPLVLKLSK PQL 136 1 3020 3431
NW_005871048.1 QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE IVH 116 1 3923 4274
NW_005871049.1 LANGRCSEGMVNYHVCLLLDRIDDPLYICD LVL 111 -1 5510 5840
NW_005871049.1 VPSNGVKPAGPGQEQRVKAGEAVESKEGGW ISF 171 1 6367 6883
NW_005871049.1 QWSEAGWPGTGAASEGRRGSGEQRGGMGGF FWL 149 1 6375 6825
NW_005871049.1 NTVIGITKCDTDGSGHVKGEGQKGTHGLPF AVC 169 1 6931 7441
NW_005871050.1 IFHPWNIPFPRESALIPQPGPGGAAQAVGT KKF 103 -1 7103 7409
NW_005871050.1 ASAPWWPVRIIAIGHSAVQSMCILAFYYYY EPL 110 -1 8475 8802
NW_005871050.1 ATAAGFPLLLPGAPAGFPLPLPGAVAGFPS ISP 362 -1 9247 10330
my script so far:
vir_seq = open("seq.txt", "r")
orf = open("ORF.txt", "r")
results = open("output.txt", "w")
#files to dictionaries with contig as a key
dict_orf={}
with open('ORF.txt') as file1:
for line in file1:
lineSplit=line.split('\t')
dict_orf.setdefault(lineSplit[0],[]).append((lineSplit[5],lineSplit[6].split('\n')[0]))
dict_virseq = {}
with open('seq.txt') as file2:
for line in file2:
lineSplit = line.split('\t')
dict_virseq.setdefault(lineSplit[0], []).append((lineSplit[1], lineSplit[4], lineSplit[5].split('\n')[0]))
# check if seq is within an ORF : start of the seq smaller than end of orf and end of seq bigger than start of orf
def is_overlap():
for k in dict_virseq.keys():
for v in dict_virseq[k]:
[v[1]>i for i in [orf[0] for orf in dict_orf[k]]]
[v[2]<i for i in [orf[1] for orf in dict_orf[k]]]
return True
else:
return False
Ok, I will try that. Seems easy enough for me;) I don't need to use Python 3. I was just using it before while doing some tutorials and continued with it because of that. I can use Python 2.7 for that too. Thanks for your help!