any idea about histogram(graphic) clustering?
2
0
Entering edit mode
5.5 years ago
boaty ▴ 220

hi guys,

I generated more than 10000 transcript coverage histogram and i want to separate them into 2 different categories .

group 1, there is a global increase of read number but no local coverage shift
original plot: Example3

plot after rescale of heigth: Eg2

group 2, there is a shift of read coverage

original plot: Eg0

rescale plot: Eg1

the blue and the orange figures are 2 different samples Yes, i know this is something very simple for human-eyes but here if we want to let machine do this the job, i have no idea about how to start it. Anyone of you has some idea about this type of question?

Thanks a lot

RNA-Seq transcript coverage • 1.6k views
ADD COMMENT
0
Entering edit mode

Are you trying to classify each transcript individually, with orange being Treatment A and blue being Treatment B? And how do you plan on handling genes with no change in coverage between your two samples? Are they already filtered out?

For the examples above you could think about a 5':3' ratio of the change in read coverage. In the first graph the ratio would be ~1, since there is a consistent change in coverage, whereas in the second graph it would be >>1 since the difference in blue reads is much higher at the 5' end. Maybe take the reads at the first and last 10% of the gene body, or something like that?

ADD REPLY
0
Entering edit mode

thanks,

Yes, for every transcript and non-changed transcripts are already filtered out by deseq2 count. i am thinking about a total scan of all selected transcript, analysis first 10% will lose a lot data,no?

ADD REPLY
0
Entering edit mode

Yes, i know this is something very simple for human-eyes but here if we want to let machine do this the job

Are you envisioning doing this as an image/general profile comparison? Those profiles are not on the same scale. So while for humans it may be easy to discern a pattern for a machine not so much. You also talk about clustering so you want to do multiple comparison as well?

ADD REPLY
0
Entering edit mode

thank you genomax,

I am open for all algo or method. my goal is to separate those transcript coverage profile but it's something new for me, need guidance

ADD REPLY
3
Entering edit mode
5.5 years ago

Convert all histograms from frequency to density then you can measure the distance between the red and blue histograms using either dynamic time warping or Wasserstein distances. Based on your examples, group 2 is expected to produce much higher values than group 1 and the histogram of the distances is expected to have two modes so a good cut-off to separate groups should be a value in the valley in between.

ADD COMMENT
0
Entering edit mode

thank you very much. the dynamic time warping and wasserstein distance are exactly what I am searching for. and converting frequency hist to probability density distribution is just genius. thanks again.

ADD REPLY
1
Entering edit mode
5.5 years ago

Here is what I would do.

First I would calculate the difference between the two curves.

Lets call this X.

X <- sum(curve_1 - curve_2)

Then I would normalise each curve so that the area under each curve was one, by dividing each curve at the sum of all of its y values. Now i would calculate the absolute difference between each curve at each point. Lets call this Y:

Y <- sum( abs( curve_1/sum(curve_1) - curve_2/sum(curve_2) ) )

Do this for each curve and then plot all the X's against all the Y's. If there are two groups (one with changed levels, but same patterns, and one with changed patterns, but same levels), this should show up on this plot.

ADD COMMENT

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6