How to work with big data from ICGC
1
1
Entering edit mode
3.9 years ago
kinalimeric ▴ 40

Hi all,

I have downloaded 56G data from ICGC and I am having trouble analyzing it on R (Although I have access to a server). I know that ICGC data is stored in GDC. However, to analyze big data on R, I see suggestions like using SQL through dplyr/R.

Do you know any way to do that? And I would like to know how do you work with big data from databases such as ICGC and TCGA when you need to use R.

I downloaded methylation data for 269 donors meth_array.PACA-AU.tsv.gz from here using wget:

https://dcc.icgc.org/releases/current/Projects/PACA-AU

I want to compare mean methylation level of selected cg probes across patients.

I would appreciate any help!

icgc bigdata sql R • 1.3k views
ADD COMMENT
3
Entering edit mode
3.9 years ago
zx8754 12k

We don't need full dataset in R, subset before importing, for example using data.table::fread + bash (not tested):

# use head meth_array.PACA-AU.tsv to find out which columns we need
# here we need columns 4 and 9
# 4=icgc_sample_id
# 6=probe_id
# 9=methylation_value

# read only columns 4 and 9 for probe cg00000029
x <- fread("grep -E '^cg00000029$' meth_array.PACA-AU.tsv | cut -f4,9")

# then aggregate as usual
x[ , .(myMean = mean(methylation_value)), by = icgc_sample_id]
ADD COMMENT
0
Entering edit mode

Thank you so much, this is really helpful and clear.

ADD REPLY

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6