Question

Gene manipulation in R

0

Entering edit mode

2.3 years ago

aj123 ▴ 120

Hi,

I have a table like below:

patient         geneid   base   count
  "ptp_1",    "BRCA1",  "C",   123,
  "ptp_1",    "BRCA1",  "G",   2,
  "ptp_1",    "BRCA1",  "T",   55,
  "ptp_2",    "BRCA2",  "A",   303,
  "ptp_2",    "BRCA2",  "C",   11
  "ptp_2",    "BRCA2",  "G",   1,

How to generate a wide data.frame that has one row per {patient x gene} and one column for each of the base's counts.

For example:

 participant   gene       A_count    C_count     G_count     T_count
 "ptp_1"       "BRCA1"     <values>
 "ptp_1"       "BRCA2"
 "ptp_2"       "BRCA1"
 "ptp_2"       "BRCA2"

I tried the following in dplyr but am not getting the exact result:

clean_df_mut_counts_wide <- clean_df_mut_counts %>% filter(base == "A") %>% group_by(participant) %>% group_by(gene) %>% summarise(A_count = sum(as.factor(base == "A")))

R • 1.6k views

ADD COMMENT • link updated 2.3 years ago by Basti ★ 2.0k • written 2.3 years ago by aj123 ▴ 120

score 2 · Accepted Answer · 2022-07-22

2

Entering edit mode

2.3 years ago

Basti ★ 2.0k

Using tidyr :

clean_df_mut_counts %>%  pivot_wider(
  names_from = base,
  values_from = count
)

ADD COMMENT • link 2.3 years ago by Basti ★ 2.0k

0

Entering edit mode

Danke! Im trying to calculate base frequency and second highest frequency base like this but its giving me a table without patient and gene-

clean_df_mut_counts_wide %>% 
    group_by(A, T, C, G) %>% 
      summarise(n = n()) %>% 
        mutate(freq= n/sum(n)) %>%
            top_n(n=2)

ADD REPLY • link 2.3 years ago by aj123 ▴ 120

1

Entering edit mode

I do not see which frequency you would like in output, would you give an example ?

ADD REPLY • link 2.3 years ago by Basti ★ 2.0k

0

Entering edit mode

patient   gene       A_count    C_count     G_count     T_count     A_freq     C_freq     G_freq      T_freq 
 "ptp_1"       "BRCA1"     20        345          777       123
 "ptp_1"       "BRCA2"      30        33            320      43
 "ptp_2"       "BRCA1"     400        203           76      56
 "ptp_2"       "BRCA2"      82        100            0      102

The above frequencies of the bases and also find the second most frequently occurring base in each patient. Hope this clarifies. thank you.

ADD REPLY • link 2.3 years ago by aj123 ▴ 120

0

Entering edit mode

Was able to achieve the above by following, after pivoting-

with_freq <- wide %>% 
  group_by(A, T, C, G) %>% 
  #summarise(n = n()) %>% 
  mutate(A_freq= A/sum(A+C+G+T), C_freq= C/sum(A+C+G+T), G_freq=G/sum(A+C+G+T), T_freq= T/sum(A+C+G+T))

Still not sure how to find 2nd most frequently occuring base per patient. Tried the following but it is not working-

with_freq_2nd_highest <- with_freq %>% slice(2)

ADD REPLY • link 2.3 years ago by aj123 ▴ 120

1

Entering edit mode

maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]    
max2=colnames(wide[,3:6])[apply(wide[,3:6], 1, maxn(2))]
with_freq$max2=max2

ADD REPLY • link 2.3 years ago by Basti ★ 2.0k