Question

how to extract some probes from a file contains duplicated row names

0

Entering edit mode

8.8 years ago

zizigolu ★ 4.4k

hi,

I have a list of IDs and I want to extract their expression profile of my normalized file but I get error

mycounts <- read.table("NormData.txt", header = T, sep = "\t")

rownames(mycounts) <- mycounts[ , 1]

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘0610007L01Rik’, ‘0610007P08Rik’, ‘0610008F07Rik’, ‘0610010F05Rik’, ‘0610010K06Rik’, ‘0610010K14Rik’, ‘0610011L14Rik’, ‘0610030E20Rik’, ‘1-Mar’, ‘1-Sep’, ‘10-Mar’, ‘11-Mar’, ‘11-Sep’, ‘1100001G20Rik’, ‘1110002E22Rik’, ‘1110003E01Rik’, ‘1110006E14Rik’, ‘1110007A13Rik’, ‘1110008L16Rik’, ‘1110017D15Rik’, ‘1110021J02Rik’, ‘1110028C15Rik’, ‘1110034B05Rik’, ‘1110034G24Rik’, ‘1110037F02Rik’, ‘1110049F12Rik’, ‘1110051M20Rik’, ‘1110057K04Rik’, ‘1110059G10Rik’, ‘1190002A17Rik’, ‘1190002N15Rik’, ‘1190003J15Rik’, ‘1190007F08Rik’, ‘12-Sep’, ‘1200014J11Rik’, ‘1300001I01Rik’, ‘1300010F03Rik’, ‘14-Sep’, ‘1500002O20Rik’, ‘1500003O03Rik’, ‘1500011B03Rik’, ‘1500011K16Rik’, ‘1500012K07Rik’, ‘1600012F09Rik’, ‘1600014C10Rik’, ‘1600029D21Rik’, ‘1700001L05Rik’, ‘1700003M02Rik’, ‘1700007K09Rik’, ‘1700008A04Rik’, ‘1700008J07Rik’, ‘1700008O03Rik’, ‘1700008P20Rik’, ‘1700009P17Rik’, ‘1700010I14Rik’, ‘1700011F14Rik’, ‘1700011H22Rik’, ‘1700012B07Rik’, ‘1700013N18Rik’, ‘170 [... truncated]

I tried

names <- read.table("names.txt", header = T, sep = "\t")       my row name file

names <- c(names)

 df = data.frame(as.matrix(mycounts))

rownames(df) = make.names(names, unique=TRUE)

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 

  invalid 'row.names' length

what to do please?

thank you

R software error • 3.1k views

ADD COMMENT • link updated 8.8 years ago by Noushin N ▴ 620 • written 8.8 years ago by zizigolu ★ 4.4k

score 1 · Answer 1 · 2016-10-26

1

Entering edit mode

8.8 years ago

Noushin N ▴ 620

One solution might be collapsing the data frame across the multiple occurrences of the same value the first column if that's acceptable, by taking average/median/etc of the value columns using ddpply (summarize) function from plyr package. e.g.

library(plyr)
mycounts.unique = ddply(mycounts, .(V1), summarize, V2 = mean(V2))
rownames(mycounts.unique) = mycounts.unique$V1

[assuming that the first two column names are V1 and V2]

After this, the values in the first column would be unique, and thus it will be possible to assign them to row names.

ADD COMMENT • link 8.8 years ago by Noushin N ▴ 620

1

Entering edit mode

merc Noushin jan :) :) :)

ADD REPLY • link 8.8 years ago by zizigolu ★ 4.4k

1

Entering edit mode

@Angel: How did you get duplicates in the first place and did they have identical values?

ADD REPLY • link 8.8 years ago by GenoMax 152k

0

Entering edit mode

thank you

you know @ genomax2

I normalized Agilent data GSE50833 by below tutorial

http://matticklab.com/index.php?title=Single_channel_analysis_of_Agilent_microarray_data_with_Limma

output file after removing columns like Start Sequence ProbeUID ControlType ProbeName GeneName SystematicName Description contained like so

substanceBXH F2_2 F2_3 F2_14 F2_15 F2_19 F2_20 F2_23 F2_24 F2_26 F2_37 F2_42 F2_43

A_30_P01033363 4.920370044 5.128868456 5.088803534 4.327204286 5.420323311 4.832380887 4.172456375 4.599314468 4.804687463 4.758797644 5.421726358 5.159465474

A_55_P1965358 6.673461411 6.559541943 6.691603173 6.84222391 7.057431615 6.728350624 6.625561924 6.503003246 6.342636712 6.480291151 6.830651816 7.000356243

A_66_P122433 3.925835915 3.671287045 4.756575578 3.827644007 4.706803712 3.207884127 3.447130951 3.852825598 2.499938067 3.543076474 4.525543409 4.068809295

the values were not duplicated and only the rownames duplicated, when I was going to extract 2000 DEGs I got an error about duplication. I could not solve the error via R therefore I removed duplicated rows by excel ignoring whether there are deferentially expressed or not :(

ADD REPLY • link 8.8 years ago by zizigolu ★ 4.4k

1

Entering edit mode

That sounds odd. So you got duplicate rows (identical gene/probe names) with different values after normalization?

ADD REPLY • link 8.8 years ago by GenoMax 152k

0

Entering edit mode

yes, I only removed unnecessary columns and kept genename column but when I checked in excel I noticed many duplicated row names

https://i.imgsafe.org/12487f4206.png

ADD REPLY • link 8.8 years ago by zizigolu ★ 4.4k

1

Entering edit mode

It is hard to see but which name are you referring to (GeneName or SystematicName)? It does not look like you have duplicate rows from that image. Should you not be using one of these names rather than the probeID?

ADD REPLY • link 8.8 years ago by GenoMax 152k

0

Entering edit mode

thank you,

I used GeneName. I also checked SystematicName which was the most duplicated. I magnified the excel file. probID is something like sequence

https://i.imgsafe.org/127caa47f3.png

ADD REPLY • link 8.8 years ago by zizigolu ★ 4.4k

1

Entering edit mode

That is not correct. The A_* are the Agilent probe ID's. GeneNames were in the next column. It looks like you did not parse/import the file correctly (or the header in your file is messed up). Please go back and correct.

ADD REPLY • link 8.8 years ago by GenoMax 152k

1

Entering edit mode

I agree that it is good practice to look into why you have duplicate values in the column in the first place. Are they technical duplicates, such as multiple probes with the same sequence repeated across several positions on the array for QC, or otherwise? or are they generated after some processing? possibly a merge call?

ADD REPLY • link 8.8 years ago by Noushin N ▴ 620

0

Entering edit mode

thank you Noushin hamvatan :)

I only normalized GSE50833 with this tutorial http://matticklab.com/index.php?title=Single_channel_analysis_of_Agilent_microarray_data_with_Limma

and I found many duplication in each column :)

ADD REPLY • link 8.8 years ago by zizigolu ★ 4.4k