Hi there!
I would like to have some advice in comparing a composition between a real genome against a randomized genome.
The question is about the randomization (as a background or Null model). When I randomize(schuffling the genome) a genome for comparing kmer composition It is important to keep the base frequency or a dinucleotide sequence?
I did read some papers but they used a expected value instead!
But I would like to look for a count method just similar to the kmer count method.
Any tip or paper using a similar approach would be appreciated!
Paulo
For transcription factor sequences or CpG islands, say, you cannot treat nucleotide frequencies as independent or shuffle bases. You might investigate hidden Markov model (HMM) approaches for generating simulated sequence, based upon categories of background regions.
Got it. Thank you @Alex Reynolds
Why not compare actual genomes? DNA sequence in genomes is not random. What kind of significance do you expect to gain from unnatural comparisons?
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058038 To make a study like similar to this...