Hi, i recently did RNA-Seq on many plant tissue samples for two genotypes in a growth chamber. Since this is quite a big experiment i wanted to visualize the data first before analyzing it. As part of it, i started clustering analysis and i used hierarchal clustering to make a dendrogram plot. By the first look of it, it seems that that some samples are out of place and i am not sure this is because of of how the clustering algorithm works or my mistakes in sampling and barcoding of samples. Can you please share your opinions.
First you need to rule out data quality issue. Check FastQC metrics, number of reads in sample and % mapped which could tell you if there is any contamination/problems with sequencing itself. Then check RNA-Seq quality statistics such as coverage, etc (you can use Picard for this), which should tell about how the library prep procedure went.
Then, if everything is ok, try to dig deep into data. Build your dendrogram using more robust measures, e.g. use Spearman correlation instead of Pearson. Also check for outliers by building sample-vs-sample scatterplots.
I don't think it is a data quality issue. The reads were dealt with strict quality control measures. I know for sure that there are barcodes that got mixed up and if that is the case then i would see the samples got swapped but i don't see anything like that. I redid clustering this time after log transformation and i what i saw is completely different. I don't know what do now?
Well to me this means that there are some sort of outliers. I would suspect smth like rRNA, but you told that there was a quality control for this (still it would be great to look at original FASTQ report).. What you can do is to build an average profile of expression by all samples, take your two SLIQUE outlier samples and make a scatter plot to identify genes that are those outliers.
Sounds good. I already have normalized counts for all the samples. So it would be pretty simple to do the scatter plot and identify outliers. Will do that. Do you recommend clustering after log transformation of normalized counts or do you don't think log transformation matters. Also do you know what other ways of transformation that i can play for clustering these samples?
Original FPKM distribution is quite skewed, so you'll want to normalize it to make it closer to normal distribution. Log-transform is the most straightforward way to do it. When you perform clustering, in R at least, Pearson correlation is selected as a default metric. Pearson correlation is most informative for normally distributed variables, i.e. gene expression profiles sampled from a bivariate normal distribution. You can evade data transformation if you use non-parametric measure, such as Spearman correlation, which operates on ranks and is more robust to outliers.
You need to drill down and look at your data in more detail (ie. at the level of gene expression), and figure out what exactly about the expression profile is making these things cluster the way that they do. There's something funny, but you can't tell what it is from just the dendrogram. It does seem likely that there was a mix-up or failure at some step of the protocol for some of the samples.
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 10.6 years ago by
Adrian
▴
700
0
Entering edit mode
I like your idea of going all the way to gene expression and see what is going on. I am currently doing that and will keep you updated.
be aware that hierarchical clustering is a class of methods, not a single one. Which linkage method (single, complete, average, ward...) did you use? Single-linkage, for instance is well-known to produce spurious results in microarray data (see here, for instance). Another point that was raised is w.r.t. the distance measure... If you suspect outliers, use more robust ones, like jackknife or spearman, instead of pearson.
Regarding the placement of the samples, well... petiole and internode seem to be mixed... you seem to have two silique samples that can be regarded as outliers and that maybe deserve more analysis... Before being too concerned about the placement of the samples, however, i would check the linkage type of the clustering algorithm, it can change quite a lot the results...
Have you checked k-means? You have the "desired" number of clusters... apply it and see if the results are more consistent...
Thanks Slash....Well I used the default "complete" clustering method for hierarchical clustering and so I should be ok there. One thing I don't really understand is when I used edge normalized non transformed counts I got a pattern that indicated couple of "leaf" samples got swapped with "petiole" samples (http://tinypic.com/r/28i01hk/8). However when I log transformed then I do get a slightly different pattern (http://tinypic.com/r/fwk84y/8). I don't really know which one to believe now.
Yes I did used k-means clustering to resolve this and here are the clusters and here are the k-means clusters with non log transformed counts. Seems like cluster 3 is real messy one with all 3 tissues falling in it. I even tried increasing the number of clusters but cannot resolve any more of that cluster. Don't know what do you do now.
yeah, your cluster 3 is weird (for the class labels you are setting)... any chance of measuring error? Moreover, k-means is not deterministic, how many runs have you tried? I'm not familiar with the problem (I'm familiar with clustering, not your particular application), but, is any other class structure that can be expected?
Clustering is exploratory analysis, so, it does not seem quite right to me that you're fixing what you expect beforehand... If you already know what clusters you want and you have the labels, why bother with clustering? My point is, if you are using clustering you should give it a try with the results you're getting and not only try to fit them to what you expect... Does that make sense to you?
While genotyping, i found that there are some samples that got mixed up and then i used edgeR MDS plot to cluster them and rename the samples. After doing that i now wanted to see if the clustering that worked for edgeR MDS holds for other methods like hierarchal and k-means clustering but unfortunately this is not the case. May i should go back to my original samples (non renamed samples) and then try to cluster them using hierarchal and k-means and see if i can fix the issue. What do you think?
Ok, now i got it. That's an option, i guess. If you are going to run k-means, be sure to run it multiple times and select as final solution the one with minimum error (sse) - only if you run it for the same number of clusters (sse decreases with more clusters), as i understand you are going to do. Good luck!
Ah now I see what your problem was. But I believe that in this case genotyping should have the final word. There are lots of papers that turned to be false-positives as the species and cells used in them were not what the authors claimed and this was shown by genotyping. Btw maybe you can try to use sequence information from RNA-Seq to re-confirm your labeling? Anyways this is quite an unpleasant situation, good luck!
I don't think it is a data quality issue. The reads were dealt with strict quality control measures. I know for sure that there are barcodes that got mixed up and if that is the case then i would see the samples got swapped but i don't see anything like that. I redid clustering this time after log transformation and i what i saw is completely different. I don't know what do now?
http://tinypic.com/r/fwk84y/8
Any idea why i see different pattern now?
Well to me this means that there are some sort of outliers. I would suspect smth like rRNA, but you told that there was a quality control for this (still it would be great to look at original FASTQ report).. What you can do is to build an average profile of expression by all samples, take your two SLIQUE outlier samples and make a scatter plot to identify genes that are those outliers.
Sounds good. I already have normalized counts for all the samples. So it would be pretty simple to do the scatter plot and identify outliers. Will do that. Do you recommend clustering after log transformation of normalized counts or do you don't think log transformation matters. Also do you know what other ways of transformation that i can play for clustering these samples?
Original FPKM distribution is quite skewed, so you'll want to normalize it to make it closer to normal distribution. Log-transform is the most straightforward way to do it. When you perform clustering, in R at least, Pearson correlation is selected as a default metric. Pearson correlation is most informative for normally distributed variables, i.e. gene expression profiles sampled from a bivariate normal distribution. You can evade data transformation if you use non-parametric measure, such as Spearman correlation, which operates on ranks and is more robust to outliers.
That's very informative. Thanks.