I've written a paper about DNA sequence analysis. This paper attempts to use Bayesian modelling for a set of DNA sequences. It will probably end up either in a statistics journal, or, more likely, in a bioinformatics journal. My concern is that biologists may take exception to some of the language in the introduction.
I'm attempting to make a connection between De Novo motif discovery, and classification on the sequences. Maybe it is a bit of a stretch. E.g. I use language like "analyzing a set of DNA sequences with biological significance solely by focusing on the motifs contained within them potentially discards valuable information, for example, possible long-range correlelations between nucleotide positions in the sequences." Also, "An alternative, and possibly complementary approach, is to consider a sequence as a single unit, and try to do direct statistical analysis on it... This approach is used in this paper, which does not use Markovian techniques. Instead, it tries to model correlation structure across the sequence."
So, the question is whether it is better to try to make an explicit connection at the risk of saying things that are incorrect and generally over-stretching, rather than just saying (which seems a little lame) that this sequence classification problem is related to De Novo motif discovery problem and leave it at that. Comments?
I include the first few paragraphs of the introduction below. This includes all the relevant language.
I'm willing to send my current draft to anyone who is interested in knowing more about the context. I don't want to post a public link to it, though.
"DNA sequence motifs are nucleotide sequence patterns that are conjectured to have a biological significance. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF). Others are involved in important processes at the RNA level, including ribosome binding, mRNA processing (splicing, editing, polyadenylation) and transcription termination. Motif discovery is a very active area of research interest. So-called “De novo computational discovery” is perhaps the most popular, where given only a set of DNA sequences, an algorithm is used to identify candidate shared motifs. This can be thought of as the task of finding a set of non- overlapping, approximately matching substrings given a starting set of strings. This is a very difficult problem.
From a more general perspective, DNA sequence analysis is often done using DNA sequence motifs. It is reasonable to ask the question - what makes a sequence a motif? From a biological perspective, a motif is simply the smallest identifiable sequence sub- component of something larger. This subcomponent can be thought of as the smallest identifiable piece of functionality related to the underlying biology, Therefore, sequence analysis often focuses on identifying these motifs. However, these motifs are typically very short, so analyzing a set of DNA sequences with biological significance solely by focusing on the motifs contained within them potentially discards valuable information, for example, possible long-range correlelations between nucleotide positions in the sequences. Note also that the statistical methods used to identify motifs are typically Markovian, like Hidden Markov Models (HMM), which are naturally tailored towards looking at small sequences.
An alternative, and possibly complementary approach, is to consider a sequence as a single unit, and try to do direct statistical analysis on it. This approach is less often used. One reason is that such sequences can quickly grow too large, and are not well suited to Markovian approaches. This approach is used in this paper, which does not use Markovian techniques. Instead, it tries to model correlation structure across the sequence.
We do this by fitting a suitable Bayesian model to that set using Bayesian model selection. As noted above, our major rationale for this model is the assumption that the nucleotide locations of this set are correlated among themselves. With this assumption in mind, we construct a family of probability distributions to capture this correlation information, described in Subsection 2.1."
Saying things that you know are incorrect is never a good idea in a scientific paper. Just because its lame should not make you invent things. I would also say that your approach is not an alternative but complementary, because short sequence motifs are the most important things one can learn of a sequence, and nobody would dismiss this information. But saying that de novo discovery as we do it now might miss things is not incorrect. The limitations are obvious, but maybe you can even come up with a reference for this, and make the introduction less wordy. Did you get any biological insight from your method?
Hi Ido, Thanks for the comments. I don't know whether what I am saying is incorrect. I said there is "a risk" that what I was saying is incorrect. I think that was I'm saying is at least plausible and reasonable (otherwise I would not have put it in), but (a) the kind of statements I'm making are intrinsically fuzzy (b) it is hitting the limits of my understanding about things like motifs (which, I frankly admit, I don't understand too well). In fact, I posted a question here asking what motifs exactly are, and also on biology.sx. The concensus from everyone who answered was that motifs are not too well-defined. :-) Basically, what I'm asking is whether there is something in the statements I've made that someone like a referee could take exception to, and if so what. Basically, I'm looking for a bit of a heads up. As far as coming up a reference as to the limitations of motifs, that's a good idea. Any specific suggestions about which directions to look in? Also, you ask whether I got any biological insight with my method. The answer is that I don't know. The method predicts RSS very well, so well that it must be picking up some kind of information, but I'm honestly not sure what to make of it. I don't use any biological information in the paper per se, just frequency information, so I conclude that long-range correlation is important and a good predictor, but I haven't drawn any further conclusions. I'm happy to send you a copy of the paper if you want to take a closer look.
I'm a biologist, and I don't understand many aspects of your introduction. (1) Why refer to de novo computational discovery as "so-called"? Your phrasing indicates the term is somewhat disingenuous or misleading. Do you mean to cast doubt, in favor of the "real" definition of de novo computational discovery? Or are you simply referring to the technique? (2) You might mention a definition of long range correlation in sequence analysis, or offer a reference. I'm familiar with motif analysis but I have no idea what long range correlation is in sequence analysis, and while all my friends could spout a definition of motif analysis off the top of their head, I bet none of them could tell me what long range correlation analysis is. (3) Your use of the term "discards information" may be a little contentious. Motif analysis within a set of sequences is simply what it states, looking for enrichment of a given pattern given a certain kind of definition. One might say it is supposed to "discard" information. However, it may be better to describe the technique as "ignoring" potentially valuable information.