Question

statistic tools of Kmer including invalid character

1

Entering edit mode

7.7 years ago

Derek Guan ▴ 20

Is there a tool that could be helpful to statistic kmers including invalid characters? For example, given a sequence "ANTTA", if K=3, then there will be three kmers which is "ANT","NTT" and "TTA".

Kmer sequence • 1.8k views

ADD COMMENT • link updated 7.7 years ago by shenwei356 8.7k • written 7.7 years ago by Derek Guan ▴ 20

0

Entering edit mode

Is this a homework? Which way do you prefer, script or existed tools?

Why do you want do this?

If you'd like to write script, just sliding the sequences with window of K and step of 1, and check every window (kmer) whether it contains non-ACGT characters.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?

ADD REPLY • link 7.7 years ago by Derek Guan ▴ 20

0

Entering edit mode

5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~. The bottle neck is that there are too many kmers :-P

$ time seqkit head -n 1 dataset_B.fa \
     | seqkit sliding -s 1 -W 33 \
     | seqkit seq -s -w 0 \
     | grep -c '[^ACGTacgt]'
18479833

seqkit head -n 1 dataset_B.fa  0.74s user 0.42s system 96% cpu 1.204 total
seqkit sliding -s 1 -W 33  329.17s user 6.58s system 101% cpu 5:31.36 total
seqkit seq -s -w 0  137.60s user 11.08s system 44% cpu 5:31.30 total
grep -c '[^ACGTacgt]'  83.15s user 4.60s system 26% cpu 5:31.29 total

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).

The result is a little different: 18,472,858 < 18,479,833.

But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.

$ time seqkit head -n 1 dataset_B.fa \
        | seqkit locate -P -p '[^ACGTacgt]+' -G \
        | sed 1d | cut -f 1,7 | seqkit tab2fx \
        | seqkit sliding -s 1 -W 33 \
        | seqkit seq -s -w 0 \
        | grep -c '[^ACGTacgt]'
18472858

seqkit head -n 1 dataset_B.fa  0.70s user 0.44s system 94% cpu 1.202 total                   
seqkit locate -P -p '[^ACGTacgt]+' -G  17.58s user 0.35s system 96% cpu 18.671 total         
sed 1d  0.01s user 0.04s system 0% cpu 18.629 total                                          
cut -f 1,7  0.06s user 0.02s system 0% cpu 18.626 total                                      
seqkit tab2fx  0.09s user 0.07s system 0% cpu 46.613 total                                   
seqkit sliding -s 1 -W 33  27.77s user 0.58s system 60% cpu 46.788 total
seqkit seq -s -w 0  10.72s user 0.93s system 24% cpu 46.778 total
grep -c '[^ACGTacgt]'  11.90s user 0.35s system 26% cpu 46.776 total

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

shenwei, Thanks for your help, I will try seqkit.

ADD REPLY • link 7.7 years ago by Derek Guan ▴ 20

0

Entering edit mode

Would you please upvote/accept my answer if it helps and works. :)

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

score 0 · Answer 1 · 2017-03-22

One way using seqkit, download:

$ echo -e ">seq\nANTTA" \
    | seqkit sliding -s 1 -W 3 \
    | seqkit grep -s -r -i -p '[^ACGT]' \
    | grep -c '^>'
2

data source: FASTA format
sliding windows (compute K-mers), window size: K, step: 1
searching FASTA sequences contain non-ACGT charactors
counting sequences

[Update 1] Faster way:

$ echo -e ">seq\nANTTA" | seqkit sliding -s 1 -W 3 \
    | seqkit seq -s -w 0 \
    | grep -c '[^ACGTacgt]'
2

[Update 2] But it's still slow when the input sequences are very large (100Mb+), a much faster way is pre-retrieving sequence regions containing non-ACGT charactors and estimate using method above. (SeqKit v0.5.2-dev or later version needed).

$ cat seqs.fasta \
    | seqkit locate -P -p '[^ACGTacgt]+' -G \
    | sed 1d | cut -f 1,7 | seqkit tab2fx \
    | seqkit sliding -s 1 -W 3 \
    | seqkit seq -s -w 0 \
    | grep -c '[^ACGTacgt]'