Question

Python -- Compare Fasta Sequences Using 'Sliding Window' Approach

0

Entering edit mode

10.9 years ago

st.ph.n ★ 2.7k

Greetings. I am trying to compare two FASTA files (one with one record, and another with multiple), using a sliding window approach. The second FASTA is very large (88gb), the first's sequence is 14kb.

Here's an example of what I want to do in python:

first_fasta

> 1
A C T T

second_fasta

>2
C C A C A C T T C T

Using a sliding window approach, which would depend on the length of the first.

window 1 = A C T = 2, 5, 3 = 11
window 2 = AC, CT, TT = 2, 2, 1 = 5
window 3 = ACT, CTT = 1, 1 = 2
window 4 = ACTT = 1

The end results I will plot the sums of the window sizes, in a histogram, with frequency on the y axis and window size on the x. The end result hopefully showing a single count for the last window size, indicating the full first sequence is contained in the second file.

I've created two small test files, but can't seem to get it right. I am new to python and appreciate any help. Is the best approach to modify Biopython's nucleic acid dot plot procedure? (http://biopython.org/DIST/docs/tutorial/Tutorial.html) (section 18.2.3)

python fasta biopython • 4.6k views

ADD COMMENT • link updated 10.9 years ago by Peter 6.0k • written 10.9 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

what is the exact output you expect for the sequences you posted?

ADD REPLY • link 10.9 years ago by brentp 24k

0

Entering edit mode

For the sequences posted, I would need the output to be the sums of the counts for each window, which I can use for the histogram. However the second file would be a large multi seq. fasta file. I've been attempting to remove all the headers and treat the entire file as a string, but it is too large.

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Seriously? The second file is an 88 gigabyte FASTA file? Wow. Are you sure this is a sensible thing to be searching with a sliding window like this?

ADD REPLY • link 10.9 years ago by Peter 6.0k

0

Entering edit mode

Do you have an alternative suggestion, Peter? I need to know what occurences of the first 14kb sequence occur in the second file, and to what length of the sequence and frequency. I thought a sliding window approach as explained above would give the most desired and accurate results.

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

My appologies, the FASTA file is 8gb.

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

score 0 · Answer 1 · 2014-02-16

The nucleotide dot plot example in the Biopython Tutorial was never intended for such massive datasets - it is relatively slow, and was intended as an illustrative example.

Have a look online for dedicated tools for this, perhaps something from the mummer suite? http://mummer.sourceforge.net/examples/