I want to be able to use a machine learning algorithm to be able to predict if a particular gene is expressed based on its binding with multiple histones/proteins (likely based on ChIP-seq data).
There would be matrix that would be sorted by regions (like a BED file) containing data such as if the region has a called peak (from ChIP-seq data), if the gene is expressed (RNA-seq data) and any other NGS data that could be integrated.
However, I am having some issues:
I’m having some trouble integrating the RNA-seq and ChIP-seq data. I’m trying to use the intersect command from bedtools but I am not getting any results.
bedtools intersect -a ref.bed -b fileA.bed fileB.bed > output.bed
Is there another/better way to see the overlap?
Ideally, I would like to be able to use multiple cell types to be able to generalize this data. However, this would require creating a third dimension to my data and all of the tools I am familiar with only take two-dimensional data. How best would I incorporate this extra dimension in my dataset?
Data with more than two dimensions are generally called tensors in the machine learning and data mining communities. There are multiple ways you could go forward depending on your data. You could try tensor regression, support tensor regression or use kernels on tensors to fall back on standard kernel methods or use tensor factorization to project your data into a latent feature space where you could use standard 2d methods. If you're into the current deep learning fashion, you could also use a neural network to extract features that you can use with a more standard machine learning method.
Out of interest, Jean-Karim, if you are working in this area, which programs / resources are you using?
I assume the area is tensors not deep learning. For this, I am using R with package rTensor as base for my own functions (e.g. tensor ridge regression). There's also the nnTensor package for non-negative factorizations.
Just about the bedtools, try
-b fileA.bed,fileB.bed