Split Large Genome
1
0
Entering edit mode
4.0 years ago
lorenzinip • 0

Hi all I would like to split a very large genome (2.4GB) into sequences of 1kb length (and possibly) with an overlapping of perhaps 100bp. What is the most efficient way of handling this? Biopython?

genome • 906 views
ADD COMMENT
2
Entering edit mode
4.0 years ago

Try seqkit sliding, e.g.,

$ echo -ne ">seq\nACTGGTCA\n" | seqkit sliding -s 3 -W 4 -g
>seq_sliding:1-4
ACTG
>seq_sliding:4-7
GGTC
>seq_sliding:7-8
CA

For you, you need set the windows size as 10,000 with a step of 9900

seqkit sliding -s 9900  -W 10000 genome.fa.gz -o genome.s.fa.gz

And you can further split the output into files with single sequence.

seqkit sliding -s 9900  -W 10000 genome.fa.gz  | seqkit split2 -s 1 -O outdir
ADD COMMENT
0
Entering edit mode

Thank you, it works very fast

ADD REPLY

Login before adding your answer.

Traffic: 1074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6