Hello everyone, here's my question:
I would like to get all the genomic coordinates relative to a very small sequence (TCTTCTC) in a reference genome. I am aware that this would result in around 200,000 coordinates or so.
I tried with blastn and with an alignent with bowtie/bwa, however the small size of the sequence does not allow for the result I need.
I tried with this biopython code:
from Bio import SeqIO
def find_coordinates(query_sequence, fasta_file):
with open(fasta_file, "r") as file:
for record in SeqIO.parse(file, "fasta"):
sequence = str(record.seq)
if query_sequence in sequence:
start = sequence.find(query_sequence) + 1
end = start + len(query_sequence) - 1
print(f"{record.id}\t{start}\t{end}")
# Provide the query sequence and FASTA file name
query_sequence = "TCTTCTC"
fasta_file = "hg38.fasta"
find_coordinates(query_sequence, fasta_file)
However, I get only one coordinate for each contig.
I am very new to biopython, is there a way to obtain all coordinates relative to TCTTCTC with a similar code?
Thanks in advance for the answers,
Giovanni
This isn't a biopython-specific issue. The string
.find()
function in python returns the first occurrence. There is no string-class function to find all occurrences of a substring; however the regular expression modulere
has afinditer
to iterate over matcheswill give you the starting index of every occurence of the query sequence.