Question

Forum:For R workshop, need bio equivalent of 'diamonds' or 'nycflights13' data.

6

Entering edit mode

10.4 years ago

Stephen 2.8k

I teach an R workshop at my university thats targeted toward researchers with little background in stats or computing. I'm working on expanding this to create a series that will include (1]) an intro to R, (2) an advanced data manipulation workshop (read: dplyr and tidyr), and (3) an advanced data visualization workshop (read: ggplot2).

In the past I've used the diamonds dataset for ggplot2 examples and the nycflights13 dataset for showing off dplyr. What I'd really like to do is find some data that will resonate with a biomedical researcher that's big (10,000+ rows) and complex enough to motivate exploring with dplyr and ggplot2, namely, some continuous measures that may correlate or behave differently depending on the level of other factor variables in the data. Something like some drug trial by cell line data, some other kind of clinical measurements by cancer type, etc. Anyone have any pointers?

Thanks

dplyr workshop R • 3.7k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Stephen 2.8k

Ram · Answer 1 · 2015-01-23

4

Entering edit mode

10.4 years ago

Jashapiro ▴ 230

I've had some luck with datasets available at the UC Irvine Machine Learning repository, which has some nice organismal measurement data sets, though not as much clinical data with very large numbers of data points.

The life sciences data is at http://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Jashapiro ▴ 230

0

Entering edit mode

Thanks, this is helpful. After limiting by multivariate, matrix, mixed data types, at at least 1000 samples, looks like the covertype dataset might be a good candidate. I'll have to look further. Thanks again.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Stephen 2.8k

Ram · Answer 2 · 2015-01-23

3

Entering edit mode

10.4 years ago

Michael Schubert ★ 7.1k

How about some data from the TCGA?

They have text tables on http://gdac.broadinstitute.org/, e.g. breast cancer clinical data.

Note: be sure to check the TCGA guidelines for what you can use the data

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

The TCGA Level 3 is RNA-Seq expression values quantified at thousands of transcripts. You can download ten replicates of three kinds of cancer and mash up a nice 20K by 30 matrix.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by karl.stamm 4.1k

Ram · Answer 3 · 2015-01-24

3

Entering edit mode

10.4 years ago

Sean Davis 27k

There is a section in my little "Intro To R" page using a biomaRt query. See the section on "Data Exploration Exercises". The .Rmd file contains the answers to the exercises.

http://watson.nci.nih.gov/~sdavis/tutorials/IntroToR/

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Sean Davis 27k