Hi all I would like to split a very large genome (2.4GB) into sequences of 1kb length (and possibly) with an overlapping of perhaps 100bp. What is the most efficient way of handling this? Biopython?
Try seqkit sliding, e.g.,
$ echo -ne ">seq\nACTGGTCA\n" | seqkit sliding -s 3 -W 4 -g >seq_sliding:1-4 ACTG >seq_sliding:4-7 GGTC >seq_sliding:7-8 CA
For you, you need set the windows size as 10,000 with a step of 9900
seqkit sliding -s 9900 -W 10000 genome.fa.gz -o genome.s.fa.gz
And you can further split the output into files with single sequence.
seqkit sliding -s 9900 -W 10000 genome.fa.gz | seqkit split2 -s 1 -O outdir
Thank you, it works very fast
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you, it works very fast