Redundant gene list simplification
4
1
Entering edit mode
6.7 years ago
lessismore ★ 1.4k

I need to convert this format:

TMCS09g1008676  fleshy  0.000234939
TMCS09g1008676  fleshy  1.38379E-05
TMCS09g1008676  fleshy  0.00331883
TMCS09g1008677  fleshy  0.0481578
TMCS09g1008678  fleshy  0.0350491
TMCS09g1008679  fleshy  0.0335639
TMCS09g1008680  fleshy  0.0167087
TMCS09g1008681  fleshy  0.00301089
TMCS09g1008682  fleshy  0.00519838
TMCS09g1008682  fleshy  0.0399833
TMCS09g1008682  fleshy  0.0122184
TMCS09g1008683  fleshy  0.00202427
TMCS09g1008683  fleshy  0.00199513
TMCS09g1008683  fleshy  0.0350491
TMCS09g1008683  fleshy  0.00331883
TMCS09g1008683  fleshy  0.0399833

to this:

TMCS09g1008676  0.000234939 1.38379E-05 0.00331883      
TMCS09g1008677  0.0481578               
TMCS09g1008678  0.0350491               
TMCS09g1008679  0.0335639               
TMCS09g1008680  0.0167087               
TMCS09g1008681  0.00301089              
TMCS09g1008682  0.00519838  0.0399833   0.0122184       
TMCS09g1008683  0.00202427  0.00199513  0.0350491   0.00331883  0.0399833

I would very much appreciate some indications with awk or R.

awk R • 1.3k views
ADD COMMENT
0
Entering edit mode

Hi lessismore,

There is no need to delete questions, especially not when someone helped you solving it.

Cheers,
Wouter

ADD REPLY
2
Entering edit mode
6.7 years ago

Not really a bioinformatics question, more suitable for SO on which it has been cross-posted. Anyway look at R aggregate() function, something along the line of:

aggregate(data[,-c(1,2)], by = list(data$V1), function(x) {x})
ADD COMMENT
1
Entering edit mode
6.7 years ago

test.txt is from OP data.

$ datamash -s -g 1 collapse 3 < test.txt  | sed 's/,/ /g'

TMCS09g1008676  0.000234939 1.38379E-05 0.00331883
TMCS09g1008677  0.0481578
TMCS09g1008678  0.0350491
TMCS09g1008679  0.0335639
TMCS09g1008680  0.0167087
TMCS09g1008681  0.00301089
TMCS09g1008682  0.00519838 0.0399833 0.0122184
TMCS09g1008683  0.00202427 0.00199513 0.0350491 0.00331883 0.0399833
ADD COMMENT
0
Entering edit mode
6.7 years ago
zx8754 12k

Using dplyr, tidyr, group by gene, then spread:

# example data
df1 <- read.table(text ="
TMCS09g1008676  fleshy  0.000234939
TMCS09g1008676  fleshy  1.38379E-05
TMCS09g1008676  fleshy  0.00331883
TMCS09g1008677  fleshy  0.0481578
TMCS09g1008678  fleshy  0.0350491
TMCS09g1008679  fleshy  0.0335639
TMCS09g1008680  fleshy  0.0167087
TMCS09g1008681  fleshy  0.00301089
TMCS09g1008682  fleshy  0.00519838
TMCS09g1008682  fleshy  0.0399833
TMCS09g1008682  fleshy  0.0122184
TMCS09g1008683  fleshy  0.00202427
TMCS09g1008683  fleshy  0.00199513
TMCS09g1008683  fleshy  0.0350491
TMCS09g1008683  fleshy  0.00331883
TMCS09g1008683  fleshy  0.0399833", stringsAsFactors = FALSE)


library(dplyr)
library(tidyr)

res <- df1[, -2] %>% 
  group_by(V1) %>% 
  mutate(rn = row_number()) %>% 
  spread(key = "rn", value = "V3") %>% 
  data.frame()


res
#               V1          X1          X2         X3         X4        X5
# 1 TMCS09g1008676 0.000234939 1.38379e-05 0.00331883         NA        NA
# 2 TMCS09g1008677 0.048157800          NA         NA         NA        NA
# 3 TMCS09g1008678 0.035049100          NA         NA         NA        NA
# 4 TMCS09g1008679 0.033563900          NA         NA         NA        NA
# 5 TMCS09g1008680 0.016708700          NA         NA         NA        NA
# 6 TMCS09g1008681 0.003010890          NA         NA         NA        NA
# 7 TMCS09g1008682 0.005198380 3.99833e-02 0.01221840         NA        NA
# 8 TMCS09g1008683 0.002024270 1.99513e-03 0.03504910 0.00331883 0.0399833
ADD COMMENT
0
Entering edit mode
6.7 years ago
EagleEye 7.6k

AWK solution


First extract only required columns:

cut -f1,3 YOUR_FILE.txt > YOUR_NEW_FILE.txt

Group your repeated entries: Assuming your file is TAB-delimited and using following command you can group repeated entries from second column by ',' (comma). You can change the separator as you wish.

awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 " " } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' YOUR_NEW_FILE.txt > YOUR_FINAL_OUTPUT.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6