Question

Get genome position and sequence with extra bases of a DNA sequence (python)

0

Entering edit mode

5.7 years ago

PaSua • 0

Hi everyone.

So I have one genome (NC_007779.1) and a sequence (AGAAGTGCCAGACT) that belongs to it. Since I am trying to develop a tool for aligning binding sites, I would like to get the start and end position of the sequence in this genome, and also get the sequence with extra bases, both at the start and end. I think the best tool for doing this would be biopython, but since I am not very familiar with the package, I don´t know how to approach this problem using it. Any suggestion?

Thank you in advance.

python biopython dna sequence position • 2.3k views

ADD COMMENT • link updated 5.7 years ago by Joe 21k • written 5.7 years ago by PaSua • 0

0

Entering edit mode

Have you had a look at any of the biopython tutorials?

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, mainly the official one http://biopython.org/DIST/docs/tutorial/Tutorial.html

ADD REPLY • link 5.7 years ago by PaSua • 0

0

Entering edit mode

Can you clarify what you mean by "also get the sequence with extra bases"?

BioPython will work, but its unlikely to be very fast, especially if your genome is large. String matching specifically can take a good while.

ADD REPLY • link 5.7 years ago by Joe 21k

score 1 · Answer 1 · 2019-03-14

A simple, flexible, and fast solution would be to use fuzznuc from the EMBOSS tools.

E.g. to get the sequence and 10 surrounding bases either side:

$ fuzznuc -sequence sequence.fasta -pattern "n(10)AGAAGTGCCAGACTn(10)" -auto

$ cat nc_007779.fuzznuc
########################################
# Program: fuzznuc
# Rundate: Thu 14 Mar 2019 08:59:12
# Commandline: fuzznuc
#    -sequence sequence.fasta
#    -pattern n(10)AGAAGTGCCAGACTn(10)
#    -auto
# Report_format: seqtable
# Report_file: nc_007779.fuzznuc
########################################

#=======================================
#
# Sequence: NC_007779.1     from: 1   to: 4646332
# HitCount: 1
#
# Pattern_name Mismatch Pattern
# pattern             0 n(10)AGAAGTGCCAGACTn(10)
#
# Complement: No
#
#=======================================

  Start     End  Strand Pattern                          Mismatch Sequence
3800290 3800323       + pattern:n(10)AGAAGTGCCAGACTn(10)        . GTGGTCAGTAAGAAGTGCCAGACTTTATATTCCA

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 4646332
# Reported_sequences: 1
# Reported_hitcount: 1
#---------------------------------------

score 0 · Answer 2 · 2019-03-13

0

Entering edit mode

5.7 years ago

Buffo ★ 2.4k

You can use finditer in python, a good examples of how to use it here

ADD COMMENT • link 5.7 years ago by Buffo ★ 2.4k