I'm a graduate student with experience in wet lab research, but I've recently come across a problem for which I am making a foray into bioinformatics. More specifically, my goal is to identify exact sequences that are repeated within one region of the genome but do not appear anywhere else in the genome. For example, I may want to find all sequences longer than 5 base pairs that are repeated more than 10 times on chromosome 2 but are not present on any other chromosome.
I imagine that others have addressed this problem, but I have no idea where to start looking for preexisting code, or even what language I should be starting out with. I have some experience programming, have taken intro bioinformatics classes, and am not afraid to learn much more, but as a beginner it's difficult for me to judge what areas I should be focusing on.
My questions are:
- Where should I look for existing resources to solve my problem?
- What language is most suited to my question?
- Should I be starting from scratch, writing my own algorithm, or should I start with something more user friendly, like Galaxy?
start searching in google and you will end up in papers like this PubMed
Thanks for the link! Because I'm unfamiliar with the jargon, it's been difficult for me to find relevant papers, but the one you reference looks like a good place for me to begin.
Actually, that's very specific, and not at all easy. I can't imagine why someone else would have wanted to solve this problem. Why do you want to solve it? Also, your definition is vague. Do you mean tandem repeats, or repeats in general?
I mean repeats in general. My long term goal is to design a single sgRNA for CRISPR that will cut at multiple positions within a defined region of a chromosome. I know that sgRNA design and avoiding off-target effects are very complicated, but I first wanted to determine if there were long enough repeats unique to regions to merit attempting to design the sgRNAs.
In the human genome, there are probably no 6bp sequences present more than 10 times in one chromosome that are not present in all chromosomes. A 6bp sequence has a 1/4^6 (or 1 in 4096) chance of occurring in random sequence. It's unrealistic to expect that to not occur in a 100Mbp chromosome, completely by chance.
You can use KmerCountExact in the BBMap package to count the occurrences of specific kmers in a genome. For example,
This will give you the counts of all 17-mers in the human genome, so you can find the ones that occur only once.