Is there a tool that could be helpful to statistic kmers including invalid characters? For example, given a sequence "ANTTA", if K=3, then there will be three kmers which is "ANT","NTT" and "TTA".
Is this a homework? Which way do you prefer, script or existed tools?
Why do you want do this?
If you'd like to write script, just sliding the sequences with window of K and step of 1, and check every window (kmer) whether it contains non-ACGT characters.
Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?
5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~.
The bottle neck is that there are too many kmers :-P
$ time seqkit head -n 1 dataset_B.fa \
| seqkit sliding -s 1 -W 33 \
| seqkit seq -s -w 0 \
| grep -c '[^ACGTacgt]'
18479833
seqkit head -n 1 dataset_B.fa 0.74s user 0.42s system 96% cpu 1.204 total
seqkit sliding -s 1 -W 33 329.17s user 6.58s system 101% cpu 5:31.36 total
seqkit seq -s -w 0 137.60s user 11.08s system 44% cpu 5:31.30 total
grep -c '[^ACGTacgt]' 83.15s user 4.60s system 26% cpu 5:31.29 total
A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).
The result is a little different: 18,472,858 < 18,479,833.
But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.
$ time seqkit head -n 1 dataset_B.fa \
| seqkit locate -P -p '[^ACGTacgt]+' -G \
| sed 1d | cut -f 1,7 | seqkit tab2fx \
| seqkit sliding -s 1 -W 33 \
| seqkit seq -s -w 0 \
| grep -c '[^ACGTacgt]'
18472858
seqkit head -n 1 dataset_B.fa 0.70s user 0.44s system 94% cpu 1.202 total
seqkit locate -P -p '[^ACGTacgt]+' -G 17.58s user 0.35s system 96% cpu 18.671 total
sed 1d 0.01s user 0.04s system 0% cpu 18.629 total
cut -f 1,7 0.06s user 0.02s system 0% cpu 18.626 total
seqkit tab2fx 0.09s user 0.07s system 0% cpu 46.613 total
seqkit sliding -s 1 -W 33 27.77s user 0.58s system 60% cpu 46.788 total
seqkit seq -s -w 0 10.72s user 0.93s system 24% cpu 46.778 total
grep -c '[^ACGTacgt]' 11.90s user 0.35s system 26% cpu 46.776 total
[Update 2] But it's still slow when the input sequences are very large (100Mb+), a much faster way is pre-retrieving sequence regions containing non-ACGT charactors and estimate using method above. (SeqKit v0.5.2-dev or later version needed).
Is this a homework? Which way do you prefer, script or existed tools?
Why do you want do this?
If you'd like to write script, just sliding the sequences with window of
K
and step of1
, and check every window (kmer) whether it contains non-ACGT
characters.Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?
5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~. The bottle neck is that there are too many kmers :-P
A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).
The result is a little different:
18,472,858 < 18,479,833
.But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.
shenwei, Thanks for your help, I will try seqkit.
Would you please upvote/accept my answer if it helps and works. :)