Hi Adrian,
I actually have a couple of programs that can be used for that general purpose, depending on the specifics.
First off, there's BBMask [bbmask.sh], which can mask low-entropy areas in a genome - for example, ATATATATATAT....etc. You can adjust the entropy window size and entropy level; the default settings mask approximately 1% of the human genome, basically covering all of the areas that are low-complexity enough that human shares them exactly with plants and fungi. BBMask can also accept a sam file mapped to a genome and mask everywhere the sam reads hit. You could, for example, shred a genome into 100bp pieces, map them to itself, and make a sam file of only the multi-mapping reads, then mask everywhere they hit.
But if you want a kmer-based approach, you can use kmercountexact.sh to generate a fasta file containing all kmers that exist at least 2 times in the genome, then mask those with BBDuk, like this:
kmercountexact.sh in=ref.fa out=kmers.fa mincount=2 k=31
bbduk.sh in=ref.fa ref=kmers.fa out=masked.fa ktrim=N k=31 mm=f
Please note that for the purposes of calling variations, I highly recommend mapping to the unmasked genome. You can then ignore variations occurring in regions that would have been masked... but if you map to the masked genome, you can end up with a read that came from the masked portion (say, a gene with 2 identical copies) mapping to a homologous-but-not-identical region, causing false-positives. Typically, if I am interested in calling high-quality variants, I throw away ambiguously-mapped (multi-mapped) reads rather than masking the genome.
I love your last approach:) Trying right now
Odd, got an error:
Maybe it's because scaffold %non-ACGTN=0.03
ok. I think it was a couple of "?" chars in my genome... No idea how they got there. Removed them and no error.
I see what you mean by mapping to an unmasked genome. Indeed, mapping to a masked genome generates a lot of false positives.
Is there anyway to generate a .bed file with information as to what regions were masked? Then it becomes easy to filter variants called in these regions.
None of my programs generate bed files, but I will plan to add that capability. Perhaps there is an existing tool somewhere that can create a bed file indicating the locations of all the Ns in a genome?
I was thinking about that too. If there are 20-50 Ns in a row, there might be no point is masking it, might just be a coincidence, but if we get regions of significant length masked (an I am not sure what those would be, maybe over 100bp?) then we can chose to exclude those. I have long looked for .bed generators of NNNN regions, they would be useful for several tools (Scaffolders do not produce .bed files and some tools request .bed locations of gaps in the assembly). I will look into this some more.