Hi,
I am analyzing the scRNA-seq data for breast cancer. Here I want to classify the cells into E+ and E- group which is based on the expression level of gene E (cells with low expression of gene E is E- and cells with high expression of gene E is E+). However, there are several members (isoforms) for gene E. I have to combine them to classify the cells. Here I have two plans:
Just simply sum the normalized expression value for these members and classify the cells based on the sum value.
Based on the clustering algorithm, the group of cells which expressed the specific combinational pattern of these members were defined as E+ and the rests were defined as E-.
Do you have any suggestions to my plans? Or do you have other plans?
Thanks
Can you clarify what you mean by isotype? To me this is a term applied to antibodies.
Thanks,
I have edited my post. There are many members for gene E in human genome. Maybe I can call it isoform. However, I don`t mean that these members were generated by different alternative splicing from the same pre-mRNA.
Those would be paralogs if they arose from duplication events. Summing up across paralogs only makes sense if you know/believe that all the paralogs contribute to the same function/biological outcome of interest.
Clustering being unsupervised doesn't guarantee that all E+ (or E-) cells will fall into one cluster although you're free to label all members of a cluster as E+ or E- based on some information. However if you can evaluate and label clusters then you probably have information that you could use for a more directly supervised approach like logistic regression.
Thank you for helpful replying,
There are about 15 paralogs for gene E and 8 paralogs in these are active paralogs. Although the enzyme activity of these 8 paralogs could not be the same. Summing up across these active paralogs could be the best way for me.
I don`t expect that cells with high expression of gene E could be clustered into the same cluster. I just want to find cells with specific combinational pattern of paralogs for gene E and try to find the pattern and difference between cells with different expression pattern. Here I named cells with specific combinational pattern of paralogs for gene E as cell with E+.
I am very interested in the approach like logistic regression you mentioned. Do you have some more detail reference? I can`t know how to do it based on what you said above.