Here's the solution of seqkit.
Preparing pattern file in FASTA format, replacing *
with N
$ more patterns.fa
>Pattern_A
ACAGTANNTATA
>Pattern_B
ATACGACGTANNCG
Searching on a FASTQ file with 9M SE100 reads.
$ seqkit locate --degenerate --pattern-file patterns.fa dataset_C.fq > result.tsv
elapsed time: 1m:22s
peak rss: 15.05 MB
The computation bottleneck is finding all matches with regular expression, which is not effective in Golang. Time is linearly dependent to the number of patterns.
1 pattern 1m:01s
2 patterns 1m:21s
3 patterns 1m:44s
Result:
seqID patternName pattern strand start end matched
K00137:236:H7NLVBBXX:6:2103:25702:14344 Pattern_A ACAGTANNTATA + 35 46 ACAGTATCTATA
K00137:236:H7NLVBBXX:6:1104:15412:4462 Pattern_A ACAGTANNTATA - 32 43 ACAGTACCTATA
K00137:236:H7NLVBBXX:6:1107:31081:13429 Pattern_A ACAGTANNTATA - 9 20 ACAGTAAATATA
K00137:236:H7NLVBBXX:6:1117:29894:16295 Pattern_B ATACGACGTANNCG - 58 71 ATACGACGTAAACG
K00137:236:H7NLVBBXX:6:1226:5152:25931 Pattern_A ACAGTANNTATA - 26 37 ACAGTATCTATA
Counting needs some help of shell
$ seqkit fx2tab patterns.fa | cut -f 2 | while read pattern <&0; do echo -e $pattern"\t"$(grep -c $pattern result.tsv); done
ACAGTANNTATA 1241
ATACGACGTANNCG 24
See @Brian's answer below.
I assume you have generated the 150 bp reads referred to in the other thread. You can use
seal.sh
to try and demultiplex them as shown below.See this thread: searching reads with a certain sequence in fastq file Adding -c option to grep should get you counts.