Question

Is There Any Tool For Converting Fasta To Reads Of Certain Length?

0

Entering edit mode

11.6 years ago

Biomonika (Noolean) 3.2k

Hi there!

Is there any tool for converting fasta to reads of certain length? Possibly with simulation of random fragmenting and unequal coverage (=based on random choice). The best would be if reads kept identification of sequence they came from like so that

>my_header
ACTGTTTACGTGGT

would lead into:

>my_header_1
TGTT
>my_header_2
GTGG
>my_header_3
TTTA

Thanks a lot :-)

fasta reads • 2.3k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.6 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

duplicate of many questions related to samtools/wgsim ... wgsim

ADD REPLY • link 11.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Well I did not know wgsim and wasn't successful with my google search of "fasta to reads", "sequence to reads convert" and other combinations.. thanks.

ADD REPLY • link 11.6 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

Search for "reads simulator". There are plenty of programs like that (hard to recommend a single one), simulating not only lengths, but also errors (depending on the sequencing platform you want to emulate). Some also allow for specifying coverage model.

ADD REPLY • link 11.6 years ago by Pawel Szczesny 3.2k

score 1 · Answer 1 · 2013-05-12

You can try this bit of code that I wrote and modify it to suit your needs:

#!/usr/bin/python

"""Digest sequences from a fasta file in sequences of a specified length.

Usage:
    %program <input_file> <fragment_length> <move> <output_file>

input_file: a fasta file
fragment_length: length of digested fragments
move: number of non-overlapping nucleotides between two consecutive fragments"""


import sys

try:
    from Bio import SeqIO
except:
    print "This program requires the Biopython library"
    sys.exit(0)

try:
    in_file = open(sys.argv[1], "rU")
    nb_nuc = int(sys.argv[2])
    move = int(sys.argv[3])
    out_file = sys.argv[4]
except:
    print __doc__
    sys.exit(0)

sequences = ([seq.id, seq.seq.tostring()] 
             for seq in SeqIO.parse(in_file, 'fasta'))

def digest(seq, nb_nuc):
    fragments = []
    n = seq[0]
    n_counter = 0
    s = seq[1]
    start = nb_nuc/2
    while len(s) > nb_nuc/2:
        frag = s[:nb_nuc - start]
        start -= move
        if start < 0:
            start = 0
            s = s[move:]
        n_counter +=1
        fragments.append([n + "_" + str("%06i" % n_counter), frag])
    return fragments

with open(out_file, "w") as out_file:
    for seq in sequences:
        fragments = digest(seq, nb_nuc)
        for frag in fragments:
            out_file.write(">" + frag[0] + "\n" + frag[1] + "\n")