I would like to find a short sequence (25 bp) that does not match anywhere within Human Reference Genome (hg38/hg19) through R scripts. I am not sure if there is a quick solution to do what I am looking for. So, I guess it would be solve, if I can create a list of random 25 bp fragments and then do an alignment. Ultimately, the lowest score output could be the supposed answer! If it might be a correct approach to do, could you please guide me how it can be done by R scripting. Thank you.
Just thinking aloud, but you could screen for low abundance kmers in the Human genome, and then create a random sequence from any low abundance/absent kmers.
That should reduce the likelihood of any matching, rather than a 'guess and check' approach.
Keep in mind also that to ask for a sequence that does not align you need to decide what makes an alignment valid (e.g, maximum number of mismatches or minimum percent identity, E-value, etc). If you don't define any constraint, then any sequence can be aligned to any other sequence by allowing enough edits. (Presumably you leave this decision to your aligner by executing it with default settings but maybe worth to think about it...)
I’ll leave it to OP to edit, as I don’t like changing peoples wording if I can avoid it, but i suggest RefSeq in the thread title be changed to hg38 or whatever. It sounds like trying to create a sequence which doesn’t align anywhere in RefSeq at the moment which is probably nigh on impossible.
Use this for some inspiration: https://bioinformatics.stackexchange.com/questions/2853/least-present-k-mers-in-the-human-genome