Question

Dimensionality reduction on data sets with variable dimensions

0

Entering edit mode

21 months ago

rtrende ▴ 80

My lab has some RNA secondary structure data on a number of virus RNA segments from a method called SHAPE-MaP, which gives each nucleotide in a sequence a reactivity value. We would like to make a 2D plot that clusters segments with similar reactivity value profiles together. However, each segment has a different number of nucleotides, and so we can't figure out how to use standard tools like PCA or t-SNE because each dataset has a different number of dimensions. Is there a way to perform dimensionality reduction on a data set where each point has a different number of dimensions?

PCA dimensionality_reduction RNA_secondary_structure • 906 views

ADD COMMENT • link updated 20 months ago by Mensur Dlakic ★ 29k • written 21 months ago by rtrende ▴ 80

score 1 · Answer 1 · 2023-12-02

1

Entering edit mode

21 months ago

Mensur Dlakic ★ 29k

Not sure whether what you want to do will be informative. That aside, UMAP works with sparse data. Simply insert missing values for shorter sequences to make the length identical to the longest sequence.

ADD COMMENT • link 21 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

I see what you mean that this method isn’t a perfect way to compare segments but I’ve had a very hard time finding other similarity metrics to use. Any idea what might be a better way to quantify or visualize how similar our data sets are?

ADD REPLY • link 20 months ago by rtrende ▴ 80

1

Entering edit mode

You can try DNA/RNA language models and extract their embeddings based on your sequences. Those should be of the same length and presumably can be used for dimensionality reduction.

ADD REPLY • link 20 months ago by Mensur Dlakic ★ 29k