Duplicates in ICGC expression data
0
0
Entering edit mode
4.2 years ago
jack.henry ▴ 50

I am trying to read in some expression data from the ICGC, however I am having some trouble with duplicates.

Firstly I read in the data.

PACACASeq <- read.table("./CountMatrices/PACA_CA/exp_seq.tsv", sep = '\t', header = TRUE, stringsAsFactors = FALSE)

Get a table like this with counts, sample Ids and gene Ids.

enter image description here

I then use reshape2 to try to convert this into a count matrix like so:

PACACASeqCounts <- dcast(PACACASeq, gene_id ~ icgc_sample_id, value.var = "raw_read_count")

But this generates the notification

Aggregation function missing: defaulting to length

Which is resultant from there being duplicates of some sample ids/counts/gene names. I end up getting a matrix of 1's.

I was wondering if anyone has come into the same problem and how they sorted it.

Thanks in advance.

RNA-Seq ICGC • 841 views
ADD COMMENT
0
Entering edit mode

Hi Jack,

We are working with the same data and we have found exactly the same problem. Did you solve it? If so, could you tell us how?

Thank you very much in advance.

Best regards,

Sergio.

ADD REPLY

Login before adding your answer.

Traffic: 1576 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6