Question

RNA-Seq time-series analysis among disease samples - analysis strategy advice

0

Entering edit mode

6.0 years ago

lu.ne ▴ 70

Hi,

I've been trying to perform some time-series analysis (identification of genes with non-constant expression over time, clustering of genes given their trend over time...) of RNA-Seq data (counts obtained from featureCounts) for a group of 200 patients with arthritis, all sampled at 5 different time points (from diagnosis and up to 2 years after that).

I've come across many packages/tools (I focused on R and Python solutions) but most of them seem focused on differential expression analysis between two conditions or more, which is not what I'm looking for. I was wondering if anyone came across the same problem and what worked to address it?

I tried using R packages, using all genes (>50 000) or subsets, especially EBSeqHMM or maSigPro as they seemed to be able to deal with this but have failed to obtain results (it seems there are too many replicates in the case of EBSeqHMM and I don't get any significant results with maSigPro). I also considered fitting linear models to each one of the genes (something like gene~time+patient_id) and cluster them based on the output models but am unsure if this is a good way to go.

Recommendations would be greatly appreciated.

Thank you

lu.ne

RNA-Seq time-series • 3.2k views

ADD COMMENT • link updated 6.0 years ago by Kristoffer Vitting-Seerup ★ 4.1k • written 6.0 years ago by lu.ne ▴ 70

0

Entering edit mode

Hi,

I have the same issue with EBSeqHMM. Did you find any solution to fix the problem related to the number of replicates?

ADD REPLY • link 5.9 years ago by Akos ▴ 20

0

Entering edit mode

Hi Akos, I have not found anything I'm afraid (that's probably because they did not intend the tool to be used in that kind of situations though).

ADD REPLY • link 5.9 years ago by lu.ne ▴ 70

0

Entering edit mode

Hi, Thank you for the quick response. I tried EBSeqHMM with different inputs. It is working with 5 time-points and triplicates per time-pint. It does not work with 5 time points, where first time point has 18 replicates and the others have 30. It was working with 4 time points and 32 replicates per time-point.

ADD REPLY • link 5.9 years ago by Akos ▴ 20

score 2 · Answer 1 · 2018-12-03

I usually, as you suggest, build a linear model (~time + patient_id + batch_factor) for each gene making sure that timepoint 0 (t0) is set as the intercept. Then I would use a F-test (anova-style) thereby extract genes which a significant change in any of the timepoints (vs t0). Such an approach can easily be done with the R packge limma - remember to use voom when you prepare the data. Limma is extremely efficient so running this number of samples is easy and the F-test on many timepoints is described in section 9.6.2 of the vignette. And afterwards I usually, as you suggest, cluster the log2FC vs t0 (typically via PAM clustering) or Mfuzz. Mfuzz can also be used without the DE analysis first.

Hope this helps. Kristoffer

score 1 · Answer 2 · 2018-12-02

1

Entering edit mode

6.0 years ago

enxxx23 ▴ 280

It would be great to have more info here, like for example: - number of replicates, - info about the times points (e.g. before and after the treatment time points?) - type of disease: cancer or non-cancer - tissues of origin for the samples (are all from the same tissue?), - is there healthy tissue and disease tissue samples available - etc.

I would say that the biggest challenge here would be the biological variation (for example, patient 1 is very different to patient 2 and patient 3 even if they have the same disease; patient 1 might be a male with blue eyes and blood type O and patient 2 might be a female with brown eyes and blood type AB) which would drown your signal which you are looking for. So I am not surprised that nothing showed in your results.

ADD COMMENT • link 6.0 years ago by enxxx23 ▴ 280

0

Entering edit mode

Sure, sorry if this was not clear, I'll edit accordingly. The replicates are actually the 200 patients, the samples are from whole blood and are collected on patients with arthritis (from when they were 'diagnosed' and then four other times with 6 months between each one of the samples). There are healthy controls available but only a single time point is available for them so I did not use those.

I assumed the lack of results could have been because of the way I was running the analyses but what you say does make perfect sense, I suppose I should have expected that.

Thanks for your input.

ADD REPLY • link 6.0 years ago by lu.ne ▴ 70