Question

How to extract short sequence from FASTA file with certain step size?

0

Entering edit mode

9.8 years ago

allyson1115ar ▴ 30

The code below extract short sequence in every sequence with the window size 100. The window will shift by step size one and extract the sequence. I would like to extract the short sequence with every step size 50. Can anyone help me?

from Bio import SeqIO

with open("B.fasta","w") as f:
    for seq_record in SeqIO.parse("A.fasta", "fasta"):
        for I in range(len(seq_record.seq) - 99) :
            f.write(str(">"+seq_record.id) + "\n")
            f.write(str(seq_record.seq[i:i+100]) + "\n")

Example of fasta file:

>hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Example output:

>hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG
>hg17_ct_ER_ER_142    TAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGG
>hg17_ct_ER_ER_142    AAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Expected output:

>hg17_ct_ER_ER_142
CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACA
>hg17_ct_ER_ER_142
AGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG

fasta alignment biopython sequence • 3.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 9.8 years ago by allyson1115ar ▴ 30

2

Entering edit mode

You can actually specify in the step size in the range function. The full range function takes in: start, end, step arguments.

For example:

for i in range(0, len(seq_record.seq) - 99, 50)

ADD REPLY • link 9.8 years ago by Damian Kao 16k

1

Entering edit mode

Hi !

Have you considered using a "while" loop instead of your second "for" loop ? Just like :

i = 0
while (i < len(seq_record.seq)) :
  f.write(">" + str(seq_record.id) + "\n")
  f.write(str(seq_record.seq[i:i+50]) + "\n")
  i += 50

ADD REPLY • link 9.8 years ago by Thibault D. ▴ 700

0

Entering edit mode

It solved perfectly. Thanks for reply. But there's one issue there are few base pairs left out in this case.

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 9.8 years ago by allyson1115ar ▴ 30

0

Entering edit mode

My fault, I've edited my last post.

ADD REPLY • link 9.8 years ago by Thibault D. ▴ 700

0

Entering edit mode

the codes are not equivalent - the original code generates overlapping window shifted by one base, this code creates non-overlapping windows that are shifted by a window size.

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 9.8 years ago by Istvan Albert 102k

0

Entering edit mode

I agree. However, I tried to fit the expected output.

ADD REPLY • link 9.8 years ago by Thibault D. ▴ 700

Ram · Answer 1 · 2015-06-12

The BBMap package (latest version, 35.07) also has a tool for this, Shred:

shred.sh in=file.fa out=shredded.fa length=100 overlap=50 minlength=51

Those settings will give you 100bp sequences, every 50bp (length-overlap=50), and the last sequence will be as short as 51bp (assuming the main sequence is not divisible by 50), ensuring that every base is covered at least once.