I have my genome as well as some predictors (eg chromosome, GC content, etc) for a response variable, for each window of the genome. I'm working in TensorFlow.
I also want to use the k-mers or maybe k-mer frequency as a predictor. My issue is dimensionality. For example, if I want to one-hot encode all 5-mers, then that is 5! = 120 columns [EDIT: actually 5^5 = 3125], which is still feasible but only captures short-range data. If I want to encode all 10-mers, that is 10^10 columns. This does not seem like the best way to go about it.
I could also think about using a 1D-CNN (not something I've done before) but in this case my understanding is that I would be effectively just feeding in the sequence. I don't see how I could both feed in sequence data as well as features like chromosome, GC content and more.
What is the best way to go about include k-mers of a genomic window as a predictor alongside some of these other features I have mentioned?
If using a CNN, you'd one-hot encode your nucleotides. There are a few ways to go about including "extra" information: 1) Include them as an extra "nucleotide", or 2) Use "Concatenate" (in Keras)
If you want to use k-mer content to predict whether your organism is bacteria or human, then just use k-mer frequencies rather than a CNN.
If you want to predict cis-regulatory profiles (e.g. scan an input sequence and say: ah hah, here's where chromatin will be accessible), then you use a CNN.
It all depends on your prediction task :)
Your math is a bit off. Ignoring reverse-complements, there are 4^5 = 1024 5-mers and 4^10 = 1048576 10-mers. Collapsing reverse-complements gives you 512 5-mers and 524800 10-mers. However, you don't 1-hot encode these, that's for raw sequence. Instead you use the abundance (fraction) as the input for each column.
Overall the choice of kmer length depends on the length of the features you are interested in and how much data you have for training.