Hi
I have been given a big set of RNA-seq, one sample looks like this
ENSG00000258486.2 1151554 1151554 597 79153.32269 78738.12898
ENSG00000265150.1 1089307 1089307 297 150505.7244 149716.2562
ENSG00000202198.1 996127 996128 331 123494.0095 122846.3529
I also have case ID for each sample like
BUT I don't know what these IDs are, which is normal, which is tumor, and there is no one to ask from
I have to reduce the features in RNA-seq data and extract the most informative genes for integrating with proteomics; In such case people usually do differential expression but I don't know the class of samples to think about DESeq2 or edgeR
So, if you were me, how would you deal with this data? How would you extract the most informative features? Is it possible to do this at all without knowing the samples identification?
Thank you for any idea
I'd reject the data.
Agree with russhh, on principal.
I am asking myself the following:
If, genuinely, nobody knows the sample groups, then do the PCA bi-plot, as implied by Genomax, and send that back to whoever it is with whom you are working. If you want, also check the component loadings along PC1 and PC2 so that you can see which genes are the main source of variation along these [principal components]. Through this process, you may actually infer the sample groupings.
The problem is that the collaborator (data owner) replies with too much delay even I am waiting for a month for an answer. That is way I either should extract informative features from this unknown RNA-seq or find another RNA-seq in internet to provide differentially expressed genes between carcinoma and matched normal samples.
So, the collaborator is the one who is disorganised and who messed up.
Em, can't you ask the person who gave you the data what are the IDs?