Thanks for the question - it has kept me busy this Sunday morning / afternoon. As implied by others, this poses a computational challenge but is not insurmountable.
For motif searching generally, I usually use AWK. My approach here was to:
- generate all possible k-mers of the chosen size (run once only)
- separately, tabulate k-mers of the chosen size in the hg38 sequence
- get the difference between #1 and #2
I should also note that I have only done this for 12-mers and for just chr22. For 16-mers, I'd likely have to use a cluster environment. Over the entire genome, I would continue to break the analysis up by chromosome.
Step 1, generate all possible k-mers of the chosen size
# generate all possible 12-mer
# 'expected' must be set as maximum possible (for 12-mer, expected = 4^12)
awk -v atgc=ATGC -v expected=16777216 -v range=4 'BEGIN{
srand(565447);
do {
seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
if (!(seq in motifs)) {
print seq
motifs[seq] = 1
count++
}
} while (count < expected)
}' > all.12mer.out
NB - each random base is produced by substr(atgc,int(1+rand()*range),1)
. In this case, for a 12-mer, there are 12 of these assigned to seq
.
For 16-mers, you need 16:
awk -v atgc=ATGC -v expected=4294967296 -v range=4 'BEGIN{
srand(565447);
do {
seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
if (!(seq in motifs)) {
print seq
motifs[seq] = 1
count++
}
} while (count < expected)
}' > all.16mer.out
all.12mer.out looks like:
head all.12mer.out
CACCCTGATCCT
ACAATGTTCGTT
GTGACGGTCGCT
TTAGGCACCGAT
GCTCATATTTAT
GCGGGTTAGATG
GTGGTTGCGCAA
GATCTAAAGTGG
There are 16,777,216 possible 12-mers. This step takes a while (20 minutes) and could be improved in efficiency. The output file is ~220MB. The good thing is that this step only needs to be run once only.
Step 2, separately, tabulate k-mers of the chosen size in the hg38 sequence (just chr22)
In this step, the entire target sequence is first flattened and converted to upper case. Then, each found 12-mer is added to an indexed AWK array, with the entry's value incremented when the same 12-mer is found again. It moves in a sliding window of 1bp:
ATGCATGCATGCATGCATGCATGC
ATGCATGCAT
|_________|
TGCATGCATG
|_________|
GCATGCATGC
|_________|
samtools faidx hg38.fasta chr22 | \
awk '{if (NR>1) printf toupper($0)}' | \
awk '{
for (i=1; i<=NF; i+=1)
if (i+11<=NF) {
m12[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)]+=1
}
} END {
for (motif in m12)
print motif","m12[motif]
}' FS='' > hg38.12mer.out
hg38.12mer.out contains each identified 12-mer and its frequency
head hg38.12mer.out
GCTCTCCCGCAG,2
TCGCTATATTGT,1
GAAACACGCCCT,2
AACCCTGGGAGT,2
GGCAGTCAGGAT,7
TCCCACTCGGCA,2
GTGAATTCTTCT,5
TCCCACTCGGCC,5
For 16-mer, you would need to modify to:
samtools faidx hg38.fasta chr22 | \
awk '{if (NR>1) printf toupper($0)}' | \
awk '{
for (i=1; i<=NF; i+=1)
if (i+15<=NF) {
m16[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)$(i+12)$(i+13)$(i+14)$(i+15)]+=1
}
} END {
for (motif in m16)
print motif","m16[motif]
}' FS='' > hg38.16mer.out
Step 3, get the difference between #1 and #2
# identify difference between all possible 12-mers and those found in the target sequence
awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out
With that, there are, apparently, 7,228,791 12-mers that are not present in chr22 in hg38:
awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out | head
ACAATGTTCGTT
GTGACGGTCGCT
GCGGGTTAGATG
GTGGTTGCGCAA
GCACGACATTGA
GCACGAGTCAAG
GTCCGCGGGATT
GCCGTGTAAAAG
CGCTTGTCACAC
GGTACCGTTTGA
I originally checked for 8-mers but all 8-mers are accounted for in chr22 in hg38.
If I manage to do 16-mer, I'll post the solution...
Kevin
Any success with the proposed approaches?
I found out 14-mers on my 48GB RAM PC, you can easily extend them to 16-mers. A: How to generate a short sequence that does not align to the RefSeq? .Email me if you need. shenwei356 at gmail