How would you feel about a code golf? Give us your pretiest, shortest, quickest pieces of code! Especially languages that nobody else has posted are welcome...
Objective: create a bed file of all homopolymers (repeats of the same nucleotides) of minimally length N based on the (human or other) reference genome.
The input is a fasta file, for example the human genome
Expected output example:
chr1 11540 11546
chr1 14908 14913
chr1 15468 15473
chr1 16318 16323
chr1 16505 16511
chr1 19735 19741
chr1 20316 20321
A useful benchmark set would be the human chromosome 22. My python code (see below), searching for 5-mers or longer, finds 244503 hits. The first 10 lines are:
chr22 16050521 16050526
chr22 16050548 16050553
chr22 16050570 16050575
chr22 16050578 16050583
chr22 16050679 16050684
chr22 16050835 16050840
chr22 16050932 16050937
chr22 16051192 16051198
chr22 16051303 16051310
chr22 16051311 16051317
DUPLICATE ! DUPLICATE ! DUPLICATE ! How to extract all the simple repeats from the hg19 reference genome
:-D
We're waiting for your html solution Pierre