statistic tools of Kmer including invalid character
1
1
Entering edit mode
7.7 years ago
Derek Guan ▴ 20

Is there a tool that could be helpful to statistic kmers including invalid characters? For example, given a sequence "ANTTA", if K=3, then there will be three kmers which is "ANT","NTT" and "TTA".

Kmer sequence • 1.8k views
ADD COMMENT
0
Entering edit mode

Is this a homework? Which way do you prefer, script or existed tools?

Why do you want do this?

If you'd like to write script, just sliding the sequences with window of K and step of 1, and check every window (kmer) whether it contains non-ACGT characters.

ADD REPLY
0
Entering edit mode

Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?

ADD REPLY
0
Entering edit mode

5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~. The bottle neck is that there are too many kmers :-P

$ time seqkit head -n 1 dataset_B.fa \
     | seqkit sliding -s 1 -W 33 \
     | seqkit seq -s -w 0 \
     | grep -c '[^ACGTacgt]'
18479833

seqkit head -n 1 dataset_B.fa  0.74s user 0.42s system 96% cpu 1.204 total
seqkit sliding -s 1 -W 33  329.17s user 6.58s system 101% cpu 5:31.36 total
seqkit seq -s -w 0  137.60s user 11.08s system 44% cpu 5:31.30 total
grep -c '[^ACGTacgt]'  83.15s user 4.60s system 26% cpu 5:31.29 total
ADD REPLY
0
Entering edit mode

A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).

The result is a little different: 18,472,858 < 18,479,833.

But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.

$ time seqkit head -n 1 dataset_B.fa \
        | seqkit locate -P -p '[^ACGTacgt]+' -G \
        | sed 1d | cut -f 1,7 | seqkit tab2fx \
        | seqkit sliding -s 1 -W 33 \
        | seqkit seq -s -w 0 \
        | grep -c '[^ACGTacgt]'
18472858

seqkit head -n 1 dataset_B.fa  0.70s user 0.44s system 94% cpu 1.202 total                   
seqkit locate -P -p '[^ACGTacgt]+' -G  17.58s user 0.35s system 96% cpu 18.671 total         
sed 1d  0.01s user 0.04s system 0% cpu 18.629 total                                          
cut -f 1,7  0.06s user 0.02s system 0% cpu 18.626 total                                      
seqkit tab2fx  0.09s user 0.07s system 0% cpu 46.613 total                                   
seqkit sliding -s 1 -W 33  27.77s user 0.58s system 60% cpu 46.788 total
seqkit seq -s -w 0  10.72s user 0.93s system 24% cpu 46.778 total
grep -c '[^ACGTacgt]'  11.90s user 0.35s system 26% cpu 46.776 total
ADD REPLY
0
Entering edit mode

shenwei, Thanks for your help, I will try seqkit.

ADD REPLY
0
Entering edit mode

Would you please upvote/accept my answer if it helps and works. :)

ADD REPLY
0
Entering edit mode
7.7 years ago

One way using seqkit, download:

$ echo -e ">seq\nANTTA" \
    | seqkit sliding -s 1 -W 3 \
    | seqkit grep -s -r -i -p '[^ACGT]' \
    | grep -c '^>'
2
  1. data source: FASTA format
  2. sliding windows (compute K-mers), window size: K, step: 1
  3. searching FASTA sequences contain non-ACGT charactors
  4. counting sequences

[Update 1] Faster way:

$ echo -e ">seq\nANTTA" | seqkit sliding -s 1 -W 3 \
    | seqkit seq -s -w 0 \
    | grep -c '[^ACGTacgt]'
2

[Update 2] But it's still slow when the input sequences are very large (100Mb+), a much faster way is pre-retrieving sequence regions containing non-ACGT charactors and estimate using method above. (SeqKit v0.5.2-dev or later version needed).

$ cat seqs.fasta \
    | seqkit locate -P -p '[^ACGTacgt]+' -G \
    | sed 1d | cut -f 1,7 | seqkit tab2fx \
    | seqkit sliding -s 1 -W 3 \
    | seqkit seq -s -w 0 \
    | grep -c '[^ACGTacgt]'
ADD COMMENT

Login before adding your answer.

Traffic: 992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6