Get genome position and sequence with extra bases of a DNA sequence (python)
2
0
Entering edit mode
5.7 years ago
PaSua • 0

Hi everyone.

So I have one genome (NC_007779.1) and a sequence (AGAAGTGCCAGACT) that belongs to it. Since I am trying to develop a tool for aligning binding sites, I would like to get the start and end position of the sequence in this genome, and also get the sequence with extra bases, both at the start and end. I think the best tool for doing this would be biopython, but since I am not very familiar with the package, I donĀ“t know how to approach this problem using it. Any suggestion?

Thank you in advance.

python biopython dna sequence position • 2.3k views
ADD COMMENT
0
Entering edit mode

Have you had a look at any of the biopython tutorials?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Can you clarify what you mean by "also get the sequence with extra bases"?

BioPython will work, but its unlikely to be very fast, especially if your genome is large. String matching specifically can take a good while.

ADD REPLY
1
Entering edit mode
5.7 years ago
Joe 21k

A simple, flexible, and fast solution would be to use fuzznuc from the EMBOSS tools.

E.g. to get the sequence and 10 surrounding bases either side:

$ fuzznuc -sequence sequence.fasta -pattern "n(10)AGAAGTGCCAGACTn(10)" -auto

$ cat nc_007779.fuzznuc
########################################
# Program: fuzznuc
# Rundate: Thu 14 Mar 2019 08:59:12
# Commandline: fuzznuc
#    -sequence sequence.fasta
#    -pattern n(10)AGAAGTGCCAGACTn(10)
#    -auto
# Report_format: seqtable
# Report_file: nc_007779.fuzznuc
########################################

#=======================================
#
# Sequence: NC_007779.1     from: 1   to: 4646332
# HitCount: 1
#
# Pattern_name Mismatch Pattern
# pattern             0 n(10)AGAAGTGCCAGACTn(10)
#
# Complement: No
#
#=======================================

  Start     End  Strand Pattern                          Mismatch Sequence
3800290 3800323       + pattern:n(10)AGAAGTGCCAGACTn(10)        . GTGGTCAGTAAGAAGTGCCAGACTTTATATTCCA

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 4646332
# Reported_sequences: 1
# Reported_hitcount: 1
#---------------------------------------
ADD COMMENT
0
Entering edit mode
5.7 years ago
Buffo ★ 2.4k

You can use finditer in python, a good examples of how to use it here

ADD COMMENT

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6