How To Split A Long Dna Sequence Into Certain Length Parts By Perl/Python ?
4
1
Entering edit mode
12.1 years ago
quge856 ▴ 70

Hi there,

Here is a long DNA sequence (in fasta), would you like to show me how to split it into certain length fragments (100nt) with 20nt overlapping? Like following:

Input:

>E.coli  
ACTG*****************************

Output:

>E.coli(1-100)  
ACTG***********************  
>E.coli(80-180)  
*******************************  
>E.coli(160-260)  
*******************************

Thank you in advance!

split perl • 12k views
ADD COMMENT
3
Entering edit mode

Would you like to tell us whether you tried to do this yourself, or if you don't know where to start with the problem?

ADD REPLY
0
Entering edit mode

A script, like JC's answer. Thank u also.

ADD REPLY
10
Entering edit mode
12.1 years ago

This can be done with Biopieces www.biopieces.org) like this:

read_fasta -i data_in.fna | split_seq -w 100 -s 20 | write_fasta -o data_out.fna -x
ADD COMMENT
1
Entering edit mode

Thank u martinahansen, after reading the Biopieces introduction, i realized it's a very very powerful tool!

ADD REPLY
10
Entering edit mode
12.1 years ago
JC 13k

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

my $len = 100;
my $over = 20;
my ($seq_id, $seq);

while (<>) {
    chomp;
    if (m/^>/) { $seq_id = $_; } else { $seq .= $_; }
}

for (my $i = 1; $i <= length $seq; $i += ($len - $over)) {
    my $s = substr ($seq, $i - 1, $len);
    print "$seq_id ($i-", $i + (length $s) - 1, ")\n$s\n";
}
ADD COMMENT
0
Entering edit mode

Thank you JC, this is the script which exactly want, and it works very well. Thanks again.

ADD REPLY
7
Entering edit mode
12.1 years ago
SES 8.6k

This can easily be done with genometools as:

gt shredder -minlength 100 -maxlength 100 -overlap 20 ecoli.fasta > ecoli_shredded.fasta

Note that there are also -coverage and -sample options for shredder that will allow you to control how your fragments are generated. Another good option is dwgsim, which is capable of doing sampling with various kinds of mutations, but this may be more than what you need. The Biopieces (mentioned by martinahansen) or genometools solutions are probably more appropriate based on your question.

ADD COMMENT
1
Entering edit mode

Thanks for your input. it looks also a useful tool besides Biopieces.

ADD REPLY
2
Entering edit mode
12.1 years ago
brentp 24k

You can use pyfasta to do this

pyfasta split -k 100 -o 20 input.fasta -n 1
ADD COMMENT
0
Entering edit mode

Thank u! now I learn more skills from you guys. lol

ADD REPLY

Login before adding your answer.

Traffic: 1597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6