Forum:For R workshop, need bio equivalent of 'diamonds' or 'nycflights13' data.
3
6
Entering edit mode
9.8 years ago
Stephen 2.8k

I teach an R workshop at my university thats targeted toward researchers with little background in stats or computing. I'm working on expanding this to create a series that will include (1]) an intro to R, (2) an advanced data manipulation workshop (read: dplyr and tidyr), and (3) an advanced data visualization workshop (read: ggplot2).

In the past I've used the diamonds dataset for ggplot2 examples and the nycflights13 dataset for showing off dplyr. What I'd really like to do is find some data that will resonate with a biomedical researcher that's big (10,000+ rows) and complex enough to motivate exploring with dplyr and ggplot2, namely, some continuous measures that may correlate or behave differently depending on the level of other factor variables in the data. Something like some drug trial by cell line data, some other kind of clinical measurements by cancer type, etc. Anyone have any pointers?

Thanks

dplyr workshop R • 3.4k views
ADD COMMENT
4
Entering edit mode
9.8 years ago
Jashapiro ▴ 230

I've had some luck with datasets available at the UC Irvine Machine Learning repository, which has some nice organismal measurement data sets, though not as much clinical data with very large numbers of data points.

The life sciences data is at http://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table

ADD COMMENT
0
Entering edit mode

Thanks, this is helpful. After limiting by multivariate, matrix, mixed data types, at at least 1000 samples, looks like the covertype dataset might be a good candidate. I'll have to look further. Thanks again.

ADD REPLY
3
Entering edit mode
9.8 years ago

How about some data from the TCGA?

They have text tables on http://gdac.broadinstitute.org/, e.g. breast cancer clinical data.

Note: be sure to check the TCGA guidelines for what you can use the data

ADD COMMENT
0
Entering edit mode

The TCGA Level 3 is RNA-Seq expression values quantified at thousands of transcripts. You can download ten replicates of three kinds of cancer and mash up a nice 20K by 30 matrix.

ADD REPLY
3
Entering edit mode
9.8 years ago

There is a section in my little "Intro To R" page using a biomaRt query. See the section on "Data Exploration Exercises". The .Rmd file contains the answers to the exercises.

http://watson.nci.nih.gov/~sdavis/tutorials/IntroToR/

ADD COMMENT

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6