Question

How to extract meaningful information from FPKM of a subset of genes

1

Entering edit mode

10.2 years ago

GR ▴ 400

Hi All,

I am completely new to this kind of analysis. I have FPKM results for 2000 genes across 50 different individuals. A phylogenetic tree is already build for these individuals.

I first created a heat map for FPKM of 2000 genes but as the dataset is big I could not extract any meaningful information here. I tried to cluster this data to see if the expression values of these individuals cluster together or not. Since I have already the phylogeny so I know which individuals are more related to each other. Should I expect that more closely related individuals should follow similar expression patterns?? Which clustering should I use (I use R for heatmaps). I am new to R clustering so a piece of code will be very helpful.

I am not sure what else meaningful information I can extract for my genes from this data. Can anyone help me.

Thanks, RT

Phylogeny FPKM RNA-Seq • 2.5k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by GR ▴ 400

0

Entering edit mode

In general you need to be cautious not oversimplifying biology phenomena. We only wish that gene expression similarities were as simple as comparing DNA at genome level. When it comes to gene expressions there are networks and pathways - small changes may magnify and manifest in radically different ways.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Istvan Albert 102k

0

Entering edit mode

Hi Istvan, I totally agree with you on this.

How about if few genes are expressed on one node and not on the other. I guess that will be a meaningful information. But I don't know how to extract this information from a 2000x50 matrix. Any ideas are welcome!!

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by GR ▴ 400

Ram · Answer 1 · 2015-02-05

Clustering is not a useful method for "proving" hypotheses, only for generating them. That said, there are some practical issues. First, unless the 2000 genes were chosen very carefully to vary between samples, a good place to start with unsupervised clustering is to use only the top X % of the most variable genes (say top 5%). These genes are the most variable across samples and, therefore, most informative. Second, you can choose various distance metrics (see help(dist) in R) as well as linkage methods (see help(hclust) in R). You WILL get different results based on your choice of genes, distance metric, and linkage method. It will be up to you to determine the best way to define a test of your hypothesis that more distantly-related members of your phylo tree have more distinct expression patterns.