Question

Split Large Genome

0

Entering edit mode

4.0 years ago

lorenzinip • 0

Hi all I would like to split a very large genome (2.4GB) into sequences of 1kb length (and possibly) with an overlapping of perhaps 100bp. What is the most efficient way of handling this? Biopython?

genome • 905 views

ADD COMMENT • link updated 4.0 years ago by shenwei356 8.7k • written 4.0 years ago by lorenzinip • 0

score 2 · Accepted Answer · 2020-11-17

2

Entering edit mode

4.0 years ago

shenwei356 8.7k

Try seqkit sliding, e.g.,

$ echo -ne ">seq\nACTGGTCA\n" | seqkit sliding -s 3 -W 4 -g
>seq_sliding:1-4
ACTG
>seq_sliding:4-7
GGTC
>seq_sliding:7-8
CA

For you, you need set the windows size as 10,000 with a step of 9900

seqkit sliding -s 9900  -W 10000 genome.fa.gz -o genome.s.fa.gz

And you can further split the output into files with single sequence.

seqkit sliding -s 9900  -W 10000 genome.fa.gz  | seqkit split2 -s 1 -O outdir

ADD COMMENT • link 4.0 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you, it works very fast

ADD REPLY • link 4.0 years ago by lorenzinip • 0