Question

counting motifs in dna sequence

2

Entering edit mode

8.2 years ago

aguilar.juanpablo ▴ 20

Hi,

I have these sequences:

GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATGA AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATAA GCAGGAATAGTTGGAACCGCCCTTAGCTTATTAATTCGAGCAGAACTCAGCCAACCTGGTGCCTTATTAGGGGATGA GCTGGCATAGTAGGAACTGCCCTTAGCCTTTTAATTCGAGCAGAGCTCAGTCAACCCGGAGCCCTGCTCGGAGATGA GCAGGAATAGTTGGAACTGCACTAAGCCTTTTAATTCGAGCTGAACTAAGCCAACCCGGAGCATTACTTGGAGACGA

They would be actually longer, but right not it is not important.

I would like to estimate a given value for the sequences given the number of motif/s in. I would like to count a (number of) motif/s like "ATCGCGCGCGCTTTAAA" in a given sequence, and then use that number to estimate a value for that sequence. I know that you can use a logical question to ask whether a given sequence has a motif but I would like to count them.

Thanks

sequence motif counting • 4.4k views

ADD COMMENT • link updated 8.2 years ago by shenwei356 8.7k • written 8.2 years ago by aguilar.juanpablo ▴ 20

0

Entering edit mode

Do you have actual motif or just a string, ATCGCGCGCGCTTTAAA ?

ADD REPLY • link 8.2 years ago by GouthamAtla 12k

0

Entering edit mode

Biostrings package in R provides a beautiful function for you purpose:gregexpr2

ADD REPLY • link 8.2 years ago by Macherki M E ▴ 120

score 4 · Answer 1 · 2016-12-06

The sequences you pasted seemed aligned sequences:

$ cat seqs.txt 
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATGA
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATAA
GCAGGAATAGTTGGAACCGCCCTTAGCTTATTAATTCGAGCAGAACTCAGCCAACCTGGTGCCTTATTAGGGGATGA
GCTGGCATAGTAGGAACTGCCCTTAGCCTTTTAATTCGAGCAGAGCTCAGTCAACCCGGAGCCCTGCTCGGAGATGA
GCAGGAATAGTTGGAACTGCACTAAGCCTTTTAATTCGAGCTGAACTAAGCCAACCCGGAGCATTACTTGGAGACGA

Let's do this using seqkit locate (usage) and csvtk freq (usage) in one-line command:

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t freq -k -f 1
seqID   frequency
1       1
2       1
3       1
4       1

Both SeqKit and csvtk provide executable binary files for Linux/Windows/Mac OS, so you can just download and run without any compilation and configuration.

Let me explain this step by step:

Step 1: converting your input sequences to FASTA format and remove gaps (-):

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g 
>1
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGG
GCTCTGCTCGGAGATGA
>2
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGA
GTTTTATTAGATAA

Step 2: searching and locating your sequence motif (e.g., ATC):

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t pretty
seqID   patternName   pattern   strand   start   end   matched
1       ATC           ATC       -        73      75    ATC
2       ATC           ATC       -        70      72    ATC
3       ATC           ATC       -        73      75    ATC
4       ATC           ATC       -        73      75    ATC

Notes:

Flag -d/--degenerate is on, so you can search motif containing degenerate base, e.g., ACTGNNNNCCC.
Command csvtk -t pretty is used to align table data.

If you have lots of sequence motifs, you can save them to file in FASTA format, and use this:

$ cat motifs.fa
>motif1
ATC
>motif2
actgn


$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t pretty
seqID   patternName   pattern   strand   start   end   matched
1       motif2        actgn     +        16      20    ACTGC
1       motif1        ATC       -        73      75    ATC
2       motif2        actgn     +        16      20    ACTGG
2       motif2        actgn     +        52      56    ACTGT
2       motif1        ATC       -        70      72    ATC
3       motif1        ATC       -        73      75    ATC
4       motif1        ATC       -        73      75    ATC
4       motif2        actgn     +        16      20    ACTGC
4       motif2        actgn     -        47      51    ACTGA
5       motif2        actgn     +        16      20    ACTGC

Step 3: counting the number of motifs in every sequences with csvtk freq

$  cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -p ATC | csvtk -t freq -k -f 1
seqID   frequency
1       1
2       1
3       1
4       1

For case of multiple motifs:

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t freq -k -f 1,2 | csvtk -t pretty
seqID   patternName   frequency
1       motif1        1
1       motif2        1
2       motif1        1
2       motif2        2
3       motif1        1
4       motif1        1
4       motif2        2
5       motif2        1

score 1 · Answer 2 · 2016-12-05

1

Entering edit mode

8.2 years ago

Istvan Albert 102k

You may use the dreg emboss tool or seqkit grep and probably many others. Even grep will work once you remove the new lines

will put the whole genome on one line, so don't do it for large genomes:

 cat fasta.fa | tr -d '\n' | grep -o "ATGC"

ADD COMMENT • link 8.2 years ago by Istvan Albert 102k

0

Entering edit mode

Hi, Istvan, it should be seqkit locate not seqkit grep

ADD REPLY • link 8.2 years ago by shenwei356 8.7k

0

Entering edit mode

Ah ok, thanks for the correction.

ADD REPLY • link 8.2 years ago by Istvan Albert 102k