Question

Collapse rows in a data.frame but add new column for multiple values

2

Entering edit mode

7.2 years ago

john.horne ▴ 20

I have a data.frame where I need to collapse rows by sample names in the indiv.ID column. Here is an excerpt:

     indiv.ID  86912632 86920881 86922082 86927699
1  Alxis_3702      CTGA     <NA>     <NA>     <NA>
2  Alxis_3702      TCTG     <NA>     <NA>     <NA>
3  Alxis_3702      <NA>        G     <NA>     <NA>
4  Alxis_3702      <NA>     <NA>        C     <NA>
5  Alxis_3702      <NA>     <NA>     <NA>     <NA>
6  Alxis_3702      <NA>     <NA>     <NA>     <NA>
7  Alxis_3702      <NA>     <NA>     <NA>     <NA>
8  Alxis_3702      <NA>     <NA>     <NA>     <NA>
9  Alxis_3702      <NA>     <NA>     <NA>     <NA>
10 Alxis_3702      <NA>     <NA>     <NA>     <NA>

In theory it should be easy to merge all rows with the same indiv.ID, but in the case of rows 1 and 2, I also need a new column (because this locus is heterozygous), so that each row only has one individual, like so:

  indiv.ID  86912632 86912632.1 86920881 86922082 86927699
Alxis_3702    CTGA         TCTG        G        C     <NA>

What is the best way to do this?

R • 5.4k views

ADD COMMENT • link updated 7.2 years ago by zx8754 12k • written 7.2 years ago by john.horne ▴ 20

0

Entering edit mode

df1=summaryBy(X86912632+ X86920881+X86922082+X86927699 ~ indiv.ID  , data=test, FUN= function (x) unique(x))
names(df1)=make.unique(gsub("\\.[A-Z]{3}[1-9]","",names(df1)))
df1
> df1
    indiv.ID X86912632 X86912632.1 X86920881 X86920881.1 X86922082 X86922082.1
1 Alxis_3702      CTGA        TCTG      <NA>        <NA>         G        <NA>
  X86927699 X86927699.1
1         C        <NA>

ADD REPLY • link 7.2 years ago by cpad0112 21k

score 5 · Accepted Answer · 2018-04-27

5

Entering edit mode

7.2 years ago

zx8754 12k

We can gather then spread, see example (I changed the input data to demonstrate that this can be used for data with multiple indiv.IDs):

# example data, with 2 IDs
df1 <- read.table(text = "     indiv.ID  86912632 86920881 86922082 86927699
                  1  Alxis_3702      CTGA     NA     NA     NA
                  2  Alxis_3702      TCTG     NA     NA     NA
                  3  Alxis_3702      NA        G     NA     NA
                  4  Alxis_3702      NA     NA        C     NA
                  5  Alxis_3702      NA     NA     NA     NA
                  6  Alxis_3702      NA     NA     NA     NA
                  7  Alxis_3703      AA     NA     NA     NA
                  8  Alxis_3703      NA     TT     CC     NA
                  9  Alxis_3703      NA     NA     NA     NA
                  10 Alxis_3703      NA     NA     NA     GG")



library(dplyr)
library(tidyr)

gather(df1, key = "k", value = "v", -indiv.ID) %>% 
  filter(!is.na(v)) %>% 
  group_by( indiv.ID) %>% 
  mutate(k = make.unique(k)) %>% 
  spread(key = k, value = v) %>% 
  ungroup()

# # A tibble: 2 x 6
#   indiv.ID   X86912632 X86912632.1 X86920881 X86922082 X86927699
#   <fct>      <chr>     <chr>       <chr>     <chr>     <chr>    
# 1 Alxis_3702 CTGA      TCTG        G         C         NA       
# 2 Alxis_3703 AA        NA          TT        CC        GG

ADD COMMENT • link 7.2 years ago by zx8754 12k

0

Entering edit mode

Nice one ! Care Biostars ate your parenthesis group_byindiv.ID) %>%

ADD REPLY • link 7.2 years ago by Bastien Hervé 6.4k

0

Entering edit mode

Thank you. But it is there, when I "edit" the post, can't seem to make it show up, any idea how I can add it?

ADD REPLY • link 7.2 years ago by zx8754 12k

0

Entering edit mode

OK, adding space after "(" worked. Thank you for letting me know.

ADD REPLY • link 7.2 years ago by zx8754 12k

0

Entering edit mode

I like your thinking, but I detect a problem when there are multiple individuals.

When you group by indiv.ID you wind up with more than two identical keys:

    indiv.ID           k                v
      <fctr>        <chr>         <chr>
1 Alxis_3702 X86912632  CTGA
2 Alxis_3702 X86912632  TCTG
3 Alxis_3703 X86912632    AA
4 Alxis_3702 X86920881     G
5 Alxis_3703 X86920881    TT
6 Alxis_3702 X86922082     C
7 Alxis_3703 X86922082    CC
8 Alxis_3703 X86927699    GG
Warning message:
attributes are not identical across measure variables;
they will be dropped

If you make each k unique, you wind up with more columns than you want. I can't think of an elegant way around this.

ADD REPLY • link 7.2 years ago by john.horne ▴ 20