How To Visualize Large Set Of Data(Fpkm Of Genes, 80 Data)
3
2
Entering edit mode
11.3 years ago
l0o0 ▴ 220

I have 80 rna-seq datas from different tissue at different treatment. After Tophat and Cufflinks, i retrieve each gene's fpkm from gene.fpkm_tracking file produced by cufflinks. There is about 20,000 genes in one sample.

I have no ideas to visualize the data in a clear, comprehensive way. I want to display these data in one graph. Any advice is appreciate! Thanks in advanced!

visualization • 8.9k views
ADD COMMENT
3
Entering edit mode

What is it you want to visualize? Do you want to visualize how samples can be classified based on expression markers? Then try a unsupervised hierarchical clustering heatmap of the 500 genes with the most variance across all your samples. Do you want to visualize what sample groups there are in your data? Make a 3d scatter plot of the first 3 principal components (PCA-analysis). Or do you want to visualize the genes that are differentially expressed across the samples? Use these genes to make a clustering heatmap

ADD REPLY
0
Entering edit mode

Thank you for your reply! I just want to visualize the distribution of 20000 genes' fpkm value from 80 pieces of data. I will try 3d scatter plot of sample, fpkm and gene id.

ADD REPLY
1
Entering edit mode

Sounds like a heatmap would work for you.

ADD REPLY
0
Entering edit mode

If you want to see the distribution of fpkm values in each sample you want to make a histogram/1d density plot of the logged (!) fpkm values of each sample. That way you will get a feel about what the mean, median, variance, minimum and maximum fpkm values you have in each sample. To me, it sounds like you do not exactly know what you want with the data. I advice you to very clearly set your goals. Begin with very high level goals and define more specific subgoals

ADD REPLY
2
Entering edit mode
10.0 years ago

Given that you are working with tissues, the heatmap approach may work well. Here is an example of a visualization I did for expression data for human fetal and adult tissues, for a set of genes of interest:

< image not found >

Here, we show the relative FPKM values for different fetal tissue expression data for BCL6. Expression is relatively enriched in thymus, but there is signal elsewhere, also.

As another example, here is a heatmap of expression data for CEBPA:

< image not found >

The expression data shown here suggest more tissue specificity.

The methodology condenses expression data for a set of tissues from various timepoints. You can explore other genes at the Gene Expression Atlas here: https://expressionatlas.org

ADD COMMENT
0
Entering edit mode

Whoa, that's cool. Where does the data come from? The (clean and pretty) website doesn't have the details about data generation. I guess it's RNA-Seq but where did you find a human fetus !? stamlab.org also doesn't have specifics...

ADD REPLY
0
Entering edit mode

It is RNA-seq data. Kyle has a publication in review. When there's a citation, I'll add it to this post.

ADD REPLY
1
Entering edit mode
11.3 years ago
seidel 11k

You have 80 data sets with 20,000 genes each, thus you want to visualize 1.6 million data points. Consider that this may be more data than pixels on most computer screens, and assess the value of needing to see all of the data in a single plot - especially if much of the data is unchanged between samples. Often, a first step is examining the data for variance in gene expression across your conditions, and performing some kind of data reduction. What percentage of genes are relatively unchanged across the 80 samples? What is the fraction of genes changed in any particular sample? If you have genes as rows, and conditions as columns, you might find that some conditions contribute large numbers of gene expression change, while other contribute very little. How you plot the data will depend on the questions you want to bring to it, but if you can apply a filter to weed out rows with little variance across gene expression, you could shrink down your 20,000 genes to something that would fit into a heat map (i.e. I would say 80 conditions by 1000 genes or fewer is reasonable). Other than that you might first tackle some summary stats to characterize what samples are contributing what properties to your matrix, or try something like PCA.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. 1.6 million sounds terrible, i will do expression test to reduce the number.

ADD REPLY
0
Entering edit mode
11.3 years ago

You could bin the FPKM values and plot the kde curve (smooth histogram) of each sample. 80 samples (curves) may result in a lot of overplotting, so it may be useful to combine the samples into groups (treatment/tissue) and draw one curve per group. This is a simple way to check if the distributions are different between groups. Another option is to use boxplots.

A few different examples of these types of graphs are shown in the GENCODE paper.

ADD COMMENT
0
Entering edit mode

Yeah! the 80 samples can be grouped into different groups. I will have a try! Thank you for your reply.

ADD REPLY

Login before adding your answer.

Traffic: 4639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6