You are correct. Statements about dimensionality make sense in the context of vector spaces where dimensionality can be defined as the max. number of linear independent vectors. Therefore, to make any statement about the dimensionality of a set of strings, which sequencing data ultimately are (NGS or not), they need to be transformed into a vector space, possibly like in this paper.
If understood that correctly, they select N prototypes and map each string to the vector space by computing the edit distance to each prototype yielding a real vector of N elements.
Formally, if we denote a set of strings (over an alphabet A) by X ⊆ A∗
and a set of prototypes by P = {p 1,...,pn}⊆X, the transformation t P
n : X → R n is defined as a (not necessarily injective) function, where
t P n (x) = (d(x, p1),...,d(x, pn)) and d(x, pi )is the edit distance
between the strings x and pi . Obviously, the dimension of the vector
space equals the number of prototypes.
I am not sure, but there seems to be no rigid proof, that the result really is a vector space (I doubt that this holds for arbitrary prototypes). However, if we assume that this is correct, then we could say, we need a large number of prototypes to fully represent sequencing data. However, if we take this position, then we can conclude that dimensionality of NGS data should be less or equal to that of any possible substring of the genome (possibly of a certain length) (edit: is this true by the way??), because all NGS reads are indeed substrings of the genome (+ some errors). That again would indicate that there is nothing special about the dimensionality of NGS data if one only wishes to be able to represent every possible outcome, but that whoever made this statement want to point out that the data is simple "large" or "of high-volume" ignoring its meaning in the context of linear algebra.
Another concept to consider in the context of machine learning techniques on strings are the string kernels which also have some applications in bioinformatics (e.g. Leslie et al.).
I don't think that your answer is quite correct. In principle, every variable is a dimension - in your RNA-Seq example, e.g. every identified transcript represents a variable with the associated read count being its value.
You are right. I've gotten it opposite way around actually. I've fixed my post.