Split read based on conserved sequence repeat
1
0
Entering edit mode
3.1 years ago
PolDE • 0

I have reads that contain repeats of 10 nt (conserved sequence is known). I wish to split the reads into subunits, using the 10 nt as "marker" to know where to split.

As example (the conserved sequence is cccgggttta):

>
acagtacccgggtttaatcgatcgatcgtacccgggtttagtacgtacgatcgtcccgggtttatgctgtcgtc

To get:

>
acagtacccgggttta
>
atcgatcgatcgtacccgggttta
>
gtacgtacgatcgtcccgggttta
>
tgctgtcgtc

Help is appreciated, thank you

conserved repeat Split-Read • 607 views
ADD COMMENT
0
Entering edit mode
3.1 years ago

I would write a Python program (this one uses BioPython) of the sorts:

from Bio import SeqIO

patt = "cccgggttta"

stream = SeqIO.parse("input.fa", format="fasta")

for rec in stream:

    pieces = rec.seq.split(patt)

    for piece in pieces[:-1]:
        print(">piece")
        print(piece + patt) 

    # Last piece does not have pattern
    print(">piece")
    print(pieces[-1])

when run produces:

>piece
acagtacccgggttta
>piece
atcgatcgatcgtacccgggttta
>piece
gtacgtacgatcgtcccgggttta
>piece
tgctgtcgtc
ADD COMMENT

Login before adding your answer.

Traffic: 2340 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6