df

Question

Reshape tabular data frame to wide format

1

Entering edit mode

7.6 years ago

friasoler ▴ 50

Hello everyone! I would appreciate some help with this data frame 'df':

df

Marker Sample allele
FlaHDF11    CP26    h
FlaHDF11    CP26    a
FlaHDF12    CP26    e
FlaHDF12    CP26    f
FlaHDF11    CP27    g
FlaHDF11    CP27    h
FlaHDF12    CP27    t
FlaHDF12    CP27    z
I would like something tlike this:
    FlaHDF11    FlaHDF11    FlaHDF12    FlaHDF12
CP26    h   a   e   f
CP27    g   h   t   z

#This is the code Im using and I suspect the problem is in the line (6). Im always expecting to have just two rows for the same sample, but the code is optimized to add extra columns in case some samples have more than 2 alleles.

a = read.table('df', stringsAsFactors=F,header=T)
a = a[-1,]
a1 = a[!duplicated(a[,1:2]),]
rownames(a1)  = paste(a1$Marker,a1$Sample)
t=1;
while(sum(duplicated(a[,1:2])) >0)   #### I think the problem is here
{   
    a1 = cbind(a1,rep(NA, dim(a1)[[1]]))
    a = a[duplicated(a[,1:2]),]
    a2 = a[!duplicated(a[,1:2]),]
    a1[paste(a2$Marker,a2$Sample),t+3] = a2$Size
    t = t+1
}
 colnames(a1) = c(colnames(a), paste('Size',1:(t-1),sep='.') )
a2 = a1[!is.na(a1$Size.2),]
m = matrix(NA,length(unique(a1$Sample)), length(unique(a1$Marker))*t)
rownames(m) = unique(a1$Sample)
colnames(m) = paste(rep(unique(a1$Marker), each=t), rep(1:4,length(unique(a1$Marker))), sep='.')
for (i in rownames(m))
{
    tt = a1[a1$Sample %in% i,]
    m[ i, paste(rep(tt$Marker,each=4), rep(1:4, dim(tt)[[1]]), sep='.')] = as.vector(t(tt[,3:6])) 
}
 m = m[,colSumsis.na(m)) < dim(m)[[1]]]

#### Error in [<-.data.frame(*tmp*, paste(a2$Marker, a2$Sample), t + 3, : replacement has length zero

Kind regards! Roberto

R • 2.4k views

ADD COMMENT • link updated 7.6 years ago by cmdcolin ★ 4.0k • written 7.6 years ago by friasoler ▴ 50

0

Entering edit mode

Just in case this is a "XY problem" (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) it could be worth contextualizing why you want to reformat your data this way

ADD REPLY • link 7.6 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

It is also unclear why you have multiple columns named the same thing in your "desired output". that complicates any potential code solution as it doesn't really make sense why the same thing would have two different columns

ADD REPLY • link 7.6 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

It has sense...the columns are loci...each one with two alleles by sample ... Thanks

ADD REPLY • link 7.6 years ago by friasoler ▴ 50

0

Entering edit mode

See my answer that tries to explicitly codify this assumption

ADD REPLY • link 7.6 years ago by cmdcolin ★ 4.0k

1

Entering edit mode

7.6 years ago

igor 13k

You don't need to do it manually. Try the tidyr spread() function: http://tidyr.tidyverse.org/reference/spread.html

ADD COMMENT • link 7.6 years ago by igor 13k

0

Entering edit mode

7.6 years ago

Joe 21k

melt or tidyr is usually used to long format dataframes

ADD COMMENT • link 7.6 years ago by Joe 21k

1

Entering edit mode

melt comes from the reshape2 package, but that converts to long format. the question is about converting to wide format. in that case, if using reshape2, you'd most likely use the acast function

ADD REPLY • link 7.6 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

Ah yes sorry misread that bit.

ADD REPLY • link 7.6 years ago by Joe 21k

0

Entering edit mode

Thanks a lot Healey!

ADD REPLY • link 7.6 years ago by friasoler ▴ 50

score 3 · Accepted Answer · 2017-04-24

If we go ahead and "codify the assumption that there are two alleles" by renaming the marker column, then you can get the desired output similar to this

library(reshape2)
x=read.table(text='Marker Sample allele
       FlaHDF11    CP26    h
       FlaHDF11    CP26    a
       FlaHDF12    CP26    e
       FlaHDF12    CP26    f
       FlaHDF11    CP27    g
       FlaHDF11    CP27    h
       FlaHDF12    CP27    t
       FlaHDF12    CP27    z',header=T,stringsAsFactors=F)

x[seq(1,nrow(x),by=2),]$Marker=paste0(x[seq(1,nrow(x),by=2),]$Marker,'a')
x[seq(2,nrow(x),by=2),]$Marker=paste0(x[seq(2,nrow(x),by=2),]$Marker,'b')

acast(x,Sample~Marker)




#Output
#FlaHDF11a FlaHDF11b FlaHDF12a FlaHDF12b
#CP26 "h"       "a"       "e"       "f"      
#CP27 "g"       "h"       "t"       "z"