Question

Obtain genomic coordinates of all occurrences of a small sequence (i.e: TCTTCTC) in a reference genome

0

Entering edit mode

17 months ago

gspirito ▴ 10

Hello everyone, here's my question:

I would like to get all the genomic coordinates relative to a very small sequence (TCTTCTC) in a reference genome. I am aware that this would result in around 200,000 coordinates or so.

I tried with blastn and with an alignent with bowtie/bwa, however the small size of the sequence does not allow for the result I need.

I tried with this biopython code:

from Bio import SeqIO

def find_coordinates(query_sequence, fasta_file):
    with open(fasta_file, "r") as file:
        for record in SeqIO.parse(file, "fasta"):
            sequence = str(record.seq)
            if query_sequence in sequence:
                start = sequence.find(query_sequence) + 1
                end = start + len(query_sequence) - 1
                print(f"{record.id}\t{start}\t{end}")


# Provide the query sequence and FASTA file name
query_sequence = "TCTTCTC"
fasta_file = "hg38.fasta"

find_coordinates(query_sequence, fasta_file)

However, I get only one coordinate for each contig.

I am very new to biopython, is there a way to obtain all coordinates relative to TCTTCTC with a similar code?

Thanks in advance for the answers,

Giovanni

biopython binding-site genomics • 1.2k views

ADD COMMENT • link updated 17 months ago by LauferVA 4.5k • written 17 months ago by gspirito ▴ 10

1

Entering edit mode

This isn't a biopython-specific issue. The string .find() function in python returns the first occurrence. There is no string-class function to find all occurrences of a substring; however the regular expression module re has a finditer to iterate over matches

import re
hits = [match.start() for match in re.finditer(query_sequence, sequence)]

will give you the starting index of every occurence of the query sequence.

ADD REPLY • link 17 months ago by LChart 4.6k

score 1 · Answer 1 · 2023-06-22

1

Entering edit mode

17 months ago

Pierre Lindenbaum 164k

https://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html

fuzznuc searches for a specified PROSITE-style pattern in nucleotide sequences. Such patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence. One or more nucleotide sequences are read from file. The output is a standard EMBOSS report file that includes data such as location and score of any matches.