Obtain genomic coordinates of all occurrences of a small sequence (i.e: TCTTCTC) in a reference genome
1
0
Entering edit mode
17 months ago
gspirito ▴ 10

Hello everyone, here's my question:

I would like to get all the genomic coordinates relative to a very small sequence (TCTTCTC) in a reference genome. I am aware that this would result in around 200,000 coordinates or so.

I tried with blastn and with an alignent with bowtie/bwa, however the small size of the sequence does not allow for the result I need.

I tried with this biopython code:

from Bio import SeqIO

def find_coordinates(query_sequence, fasta_file):
    with open(fasta_file, "r") as file:
        for record in SeqIO.parse(file, "fasta"):
            sequence = str(record.seq)
            if query_sequence in sequence:
                start = sequence.find(query_sequence) + 1
                end = start + len(query_sequence) - 1
                print(f"{record.id}\t{start}\t{end}")


# Provide the query sequence and FASTA file name
query_sequence = "TCTTCTC"
fasta_file = "hg38.fasta"

find_coordinates(query_sequence, fasta_file)

However, I get only one coordinate for each contig.

I am very new to biopython, is there a way to obtain all coordinates relative to TCTTCTC with a similar code?

Thanks in advance for the answers,

Giovanni

biopython binding-site genomics • 1.2k views
ADD COMMENT
1
Entering edit mode

This isn't a biopython-specific issue. The string .find() function in python returns the first occurrence. There is no string-class function to find all occurrences of a substring; however the regular expression module re has a finditer to iterate over matches

import re
hits = [match.start() for match in re.finditer(query_sequence, sequence)]

will give you the starting index of every occurence of the query sequence.

ADD REPLY
1
Entering edit mode
17 months ago

https://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html

fuzznuc searches for a specified PROSITE-style pattern in nucleotide sequences. Such patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence. One or more nucleotide sequences are read from file. The output is a standard EMBOSS report file that includes data such as location and score of any matches.

ADD COMMENT
0
Entering edit mode

also, use T2T-CHM13v2+ for this.

ADD REPLY

Login before adding your answer.

Traffic: 1064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6