Double Digest through Regular Expression in python
2
0
Entering edit mode
4.0 years ago
anasjamshed ▴ 140

I have the following DNA sequence in dna.txt file :

ATGGCAATAACCCCCCGTTTCTACTTCTAGAGGAGAAAAGTATTGACATGAGCGCTCCCGGCACAAGGGCCAAAGAAGTCTCCAATTTCTTATTTCCGAATGACATGCGTCTCCTTGCGGGTAAATCACCGACCGCAATTCATAGAAGCCTGGGGGAACAGATAGGTCTAATTAGCTTAAGAGAGTAAATCCTGGGATCATTCAGTAGTAACCATAAACTTACGCTGGGGCTTCTTCGGCGGATTTTTACAGTTACCAACCAGGAGATTTGAAGTAAATCAGTTGAGGATTTAGCCGCGCTATCCGGTAATCTCCAAATTAAAACATACCGTTCCATGAAGGCTAGAATTACTTACCGGCCTTTTCCATGCCTGCGCTATACCCCCCCACTCTCCCGCTTATCCGTCCGAGCGGAGGCAGTGCGATCCTCCGTTAAGATATTCTTACGTGTGACGTAGCTATGTATTTTGCAGAGCTGGCGAACGCGTTGAACACTTCACAGATGGTAGGGATTCGGGTAAAGGGCGTATAATTGGGGACTAACATAGGCGTAGACTACGATGGCGCCAACTCAATCGCAGCTCGAGCGCCCTGAATAACGTACTCATCTCAACTCATTCTCGGCAATCTACCGAGCGACTCGATTATCAACGGCTGTCTAGCAGTTCTAATCTTTTGCCAGCATCGTAATAGCCTCCAAGAGATTGATGATAGCTATCGGCACAGAACTGAGACGGCGCCGATGGATAGCGGACTTTCGGTCAACCACAATTCCCCACGGGACAGGTCCTGCGGTGCGCATCACTCTGAATGTACAAGCAACCCAAGTGGGCCGAGCCTGGACTCAGCTGGTTCCTGCGTGAGCTCGAGACTCGGGATGACAGCTCTTTAAACATAGAGCGGGGGCGTCGAACGGTCGAGAAAGTCATAGTACCTCGGGTACCAACTTACTCAGGTTATTGCTTGAAGCTGTACTATTTTAGGGGGGGAGCGCTGAAGGTCTCTTCTTCTCATGACTGAACTCGCGAGGGTCGTGAAGTCGGTTCCTTCAATGGTTAAAAAACAAAGGCTTACTGTGCGCAGAGGAACGCCCATCTAGCGGCTGGCGTCTTGAATGCTCGGTCCCCTTTGTCATTCCGGATTAATCCATTTCCCTCATTCACGAGCTTGCGAAGTCTACATTGGTATATGAATGCGACCTAGAAGAGGGCGCTTAAAATTGGCAGTGGTTGATGCTCTAAACTCCATTTGGTTTACTCGTGCATCACCGCGATAGGCTGACAAAGGTTTAACATTGAATAGCAAGGCACTTCCGGTCTCAATGAACGGCCGGGAAAGGTACGCGCGCGGTATGGGAGGATCAAGGGGCCAATAGAGAGGCTCCTCTCTCACTCGCTAGGAGGCAAATGTAAAACAATGGTTACTGCATCGATACATAAAACATGTCCATCGGTTGCCCAAAGTGTTAAGTGTCTATCACCCCTAGGGCCGTTTCCCGCATATAAACGCCAGGTTGTATCCGCATTTGATGCTACCGTGGATGAGTCTGCGTCGAGCGCGCCGCACGAATGTTGCAATGTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCGTCGCTGTACTAATAGTTGTCGACAGACCGTCGAGATTAGAAAATGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATTGCGGTGTCTTCACTGTGCTGCGGCTACCCATCGCCTGAAATCCAGCTGGTGTCAAGCCATCCCCTCTCCGGGACGCCGCATGTAGTGAAACATATACGTTGCACGGGTTCACCGCGGTCCGTTCTGAGTCGACCAAGGACACAATCGAGCTCCGATCCGTACCCTCGACAAACTTGTACCCGACCCCCGGAGCTTGCCAGCTCCTCGGGTATCATGGAGCCTGTGGTTCATCGCGTCCGATATCAAACTTCGTCATGATAAAGTCCCCCCCTCGGGAGTACCAGAGAAGATGACTACTGAGTTGTGCGAT

I want to read the DNA sequence from the file dna.txt and then predict the lengths of the fragments that we will get by digesting the sequence with the (made-up) restriction enzymes

  • AbcI: cutting site "ANT*AAT"
  • AbcII: cutting site "GCRW*TG"

asterisks indicate where the enzyme cuts the DNA

Can anyone solve my query?

Regex Resriction-Enzyme Python • 2.8k views
ADD COMMENT
0
Entering edit mode

Did you try anything? Biostars is generally not a code-writing service.

ADD REPLY
0
Entering edit mode
import re

# open input file
infile = open("dna.txt")
line = infile.read()
# split line by "," into list of strings
sequence = line.strip().split(",")

print(sequence)

after that, I am unable to do?

ADD REPLY
0
Entering edit mode

Did you found this question (SO) https://stackoverflow.com/questions/43365742/cut-string-within-a-specific-pattern-in-python ? It has at least 3 answers which I think should be useful to you.

Only one side note: since your made-up cut-sites has ambiguous bases, you need to handle those (or use some library which does that for you).

Good luck

ADD REPLY
0
Entering edit mode

It is something diffrent

ADD REPLY
0
Entering edit mode

Like the other user said, those answers should be useful, not applicable as-is. Try adapting them and come back to us if you have difficulties.

ADD REPLY
0
Entering edit mode
3.9 years ago
Dunois ★ 2.8k

Here's something to get you started:

import re
def digdigest(digseq, cutsite_orig):
  cutsite_l, cutsite_r = re.split("\*", cutsite_orig)

  #Identify cutting positions with a unique character
  cutsite = re.sub("\*", "", cutsite_orig)
  cutsite

  #If there are Ns in the cutsite, replace this with the . regex placeholder
  cutsite = re.sub("N", ".", cutsite)
  cutsite_l = re.sub("N", ".", cutsite_l)
  cutsite_r = re.sub("N", ".", cutsite_r)

  #cutsite_l+"__"+cutsite_r
  digseq_mod = re.sub(r"("+cutsite_l+")"+"("+cutsite_r+")", r"\1__\2", digseq)



  print(digseq_mod.split("__"))
  #return(digseq_mod.split("__"))

Provide digdigest() the sequence and cutting site as you have indicated in the OP, and you'll get something like this:

digdigest("ATATATATAGTAATGTGTGCATTAATATGC", "ANT*AAT")
#['ATATATATAGT', 'AATGTGTGCATT', 'AATATGC']
ADD COMMENT
0
Entering edit mode
3.8 years ago
anasjamshed ▴ 140

i have tried this script:

import re
dna = open("dna.txt").read().strip("\n")
#print(str(len(dna)))
all_cuts = [0]

# finda and append different cut positions for AbcI
for match in re.finditer(r"A[ATGC]TAAT", dna):#begins with A followed by any one character [ATGC]and then TAAT
    all_cuts.append(match.start() + 3) #ANT*AAT, finding position is A therefore +3 for ANT
print(all_cuts)

# add cut positions for AbcII
for match in re.finditer(r"GC[AG][AT]TG", dna):
    all_cuts.append(match.start() + 4) # GCRW*TG the finding position is G therefore +4 for GCRW, 
print(all_cuts)

# add the finalend position i.e. length of dna
all_cuts.append(len(dna))
sorted_cuts = sorted(all_cuts)# Sort the cut position in ascending order so that we get length of each fragements correct
print(sorted_cuts)
for i in range(1,len(sorted_cuts)):
    this_cut_position = sorted_cuts[i]
    previous_cut_position = sorted_cuts[i-1]
    fragment_size = this_cut_position - previous_cut_position
    print("one fragment size is " + str(fragment_size))

And it gives me the following output :

[0, 1143, 1628]
[0, 1143, 1628, 488, 1577]
[0, 488, 1143, 1577, 1628, 2012]
one fragment size is 488
one fragment size is 655
one fragment size is 434
one fragment size is 51
one fragment size is 384
ADD COMMENT

Login before adding your answer.

Traffic: 2644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6