My lab has some RNA secondary structure data on a number of virus RNA segments from a method called SHAPE-MaP, which gives each nucleotide in a sequence a reactivity value. We would like to make a 2D plot that clusters segments with similar reactivity value profiles together. However, each segment has a different number of nucleotides, and so we can't figure out how to use standard tools like PCA or t-SNE because each dataset has a different number of dimensions. Is there a way to perform dimensionality reduction on a data set where each point has a different number of dimensions?
I see what you mean that this method isn’t a perfect way to compare segments but I’ve had a very hard time finding other similarity metrics to use. Any idea what might be a better way to quantify or visualize how similar our data sets are?
You can try DNA/RNA language models and extract their embeddings based on your sequences. Those should be of the same length and presumably can be used for dimensionality reduction.