The sequences you pasted seemed aligned sequences:
$ cat seqs.txt
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATGA
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATAA
GCAGGAATAGTTGGAACCGCCCTTAGCTTATTAATTCGAGCAGAACTCAGCCAACCTGGTGCCTTATTAGGGGATGA
GCTGGCATAGTAGGAACTGCCCTTAGCCTTTTAATTCGAGCAGAGCTCAGTCAACCCGGAGCCCTGCTCGGAGATGA
GCAGGAATAGTTGGAACTGCACTAAGCCTTTTAATTCGAGCTGAACTAAGCCAACCCGGAGCATTACTTGGAGACGA
Let's do this using seqkit locate
(usage)
and csvtk freq
(usage) in one-line command:
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t freq -k -f 1
seqID frequency
1 1
2 1
3 1
4 1
Both SeqKit
and csvtk
provide executable binary files for Linux/Windows/Mac OS,
so you can just download and run without any compilation and configuration.
Let me explain this step by step:
Step 1: converting your input sequences to FASTA format and remove gaps (-
):
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g
>1
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGG
GCTCTGCTCGGAGATGA
>2
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGA
GTTTTATTAGATAA
Step 2: searching and locating your sequence motif (e.g., ATC
):
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t pretty
seqID patternName pattern strand start end matched
1 ATC ATC - 73 75 ATC
2 ATC ATC - 70 72 ATC
3 ATC ATC - 73 75 ATC
4 ATC ATC - 73 75 ATC
Notes:
- Flag
-d/--degenerate
is on, so you can search motif containing degenerate base, e.g., ACTGNNNNCCC
.
- Command
csvtk -t pretty
is used to align table data.
If you have lots of sequence motifs, you can save them to file in FASTA format,
and use this:
$ cat motifs.fa
>motif1
ATC
>motif2
actgn
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t pretty
seqID patternName pattern strand start end matched
1 motif2 actgn + 16 20 ACTGC
1 motif1 ATC - 73 75 ATC
2 motif2 actgn + 16 20 ACTGG
2 motif2 actgn + 52 56 ACTGT
2 motif1 ATC - 70 72 ATC
3 motif1 ATC - 73 75 ATC
4 motif1 ATC - 73 75 ATC
4 motif2 actgn + 16 20 ACTGC
4 motif2 actgn - 47 51 ACTGA
5 motif2 actgn + 16 20 ACTGC
Step 3: counting the number of motifs in every sequences with csvtk freq
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -p ATC | csvtk -t freq -k -f 1
seqID frequency
1 1
2 1
3 1
4 1
For case of multiple motifs:
$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t freq -k -f 1,2 | csvtk -t pretty
seqID patternName frequency
1 motif1 1
1 motif2 1
2 motif1 1
2 motif2 2
3 motif1 1
4 motif1 1
4 motif2 2
5 motif2 1
Do you have actual motif or just a string,
ATCGCGCGCGCTTTAAA
?Biostrings package in R provides a beautiful function for you purpose:gregexpr2