manipulating a file by R or bash?
4
0
Entering edit mode
6.1 years ago
Za ▴ 140

Hi,

I have a big file as below;

GeneID  GO  geneid

Could you please help me, to have a dataframe with 3 more columns for independently? So these values would come in front of corresponding genes and GO term

Thanks a lot for any help

R bash text parsing • 2.4k views
ADD COMMENT
0
Entering edit mode

dput would help better understanding df.

ADD REPLY
0
Entering edit mode

Sorry, help me please

Instead of placed in below rows of genes, I want them independently in 3 columns

For example ideal results would be

ADD REPLY
0
Entering edit mode

Where is the data gone?

ADD REPLY
4
Entering edit mode
6.1 years ago
Ram 44k

I think you can use awk on the mixed-up file and match $1 to a floating point value [0-9.]+, writing only those rows to a separate file. Then you can grep -Fvx the newly generated file against the larger file to create a third file that excludes all numeric valued rows, containing only the GeneID-GO-geneid rows.

Now you'd have two files with two different, internally consistent data formats that you can use any tool of your choice to process.

I'm not giving you a ready-made solution as I've addressed most important parts and you should be able to play around with awk and grep and get to the final solution.

ADD COMMENT
0
Entering edit mode

Sorry you mean

awk $1 [0-9.]+ file.txt > file1.txt

Gives

awk: syntax error at source line 1
 context is
     >>> [ <<< 
awk: bailing out at source line 1
ADD REPLY
1
Entering edit mode

That is not how awk works. At least make an attempt to read some documentation

ADD REPLY
0
Entering edit mode

Sorry believe me but I don't have time stress does not allow me to focus on googling

However I found this in R, I wanna try that

ADD REPLY
2
Entering edit mode

Whatever works for you, I guess. These utilities are quite easy to learn, here is awk syntax:

awk 'pattern { action }' in_file

pattern is the part where you can use $1, etc to pick where you want to do an operation/action, and action is the operation/action you wish to perform. awk '$1==10 {print}' prints where the first column has value 10.

See this tl;dr page for a few examples.

ADD REPLY
0
Entering edit mode

Hi, @RamRS

Please don't be angry at me

Honestly, yesterday an office mate asked me to solve this problem for him but neither I had time nor enough knowledge to solve that; So, I asked my question here and biostars kindly solved my problem as always. I sent him the solution and he got pleased a lot because he does know too much less than me in such staffs. Suddenly, I thought by copying and pasting these code in google, he would simply find my post here. So, honestly, I was shy if he knew how I solved his problem. Therefore, I removed some part of data because gene IDs is specific to his working data. I am sorry for that and thank you for helping me

Believe, I am not a too bad person :( a bit stupid though

ADD REPLY
0
Entering edit mode

Za, there is nothing to be angry about. We know you're doing your best, I would nevertheless like for you to try a bit harder to learn basic unix utilities as they empower you and when you wield them well, they can give a silver lining to a bad day.

IMHO you should not take on tasks that you cannot handle, especially if there is an option to pass on them. At this point, you need to focus on your learning and avoid taking on time-bound tasks that you cannot learn from in a relaxed manner. If you try to do people favors by stretching yourself thin, you're not giving yourself the attention you deserve. Help others but not at your cost.

You're neither a bad person nor are you stupid. Don't call yourself names, that only affects your growth. You might feel stupid at times, that does not make you a stupid person.

ADD REPLY
3
Entering edit mode
6.1 years ago
n,n ▴ 370

I also came up with a scripting solution which I'll explain step by step (but won't throw the literal code since this is a good exercise for learning). I also assumed your input is tab-delimited, that all gene Ids start with "DDB" and that there is only 3 values per unique GO ID as shown in your post:

1- isolate the lines that contain pvalue, oddsratio and expcount with grep in a new file

2- get a GO ID element from column 2 of the new file with awk and print the corresponding $1 column (values themselves) if the GO ID element matches $2 (GO IDs) in the new file, you should get 3 values.

3- paste the 3 printed matches to a tmp file in a single row

4- define variables for each column in the tmp file (3 variables each one pointing to pvalue, oddsratio and expcount respectively)

5- grep your original input for the GO ID that you are working with from step 2 and sed the resulting lines with the 3 defined variables separated by tabs at the end of each line, append the output to a file that will be your output file.

6- this process describes what you need to do for a single GO ID, however you will notice it is possible to organize the whole thing starting from step 2 in a loop so that you automate the process for a list of uniq GO IDs that you can get in step 2.

Hope this helps, what you wanna do can be kinda tricky so don't feel discouraged.

ADD COMMENT
3
Entering edit mode
6.1 years ago
zx8754 12k

By the time I posted this answer the data got removed, so I am using below as example input:

# example dataframe
df1 <- read.table(text = "a   ID  desc
                  DDB_G0284157    GO:0042981  geneIds1
                  DDB_G0285005    GO:0042981  geneIds2
                  DDB_G0285251    GO:0042981  geneIds3
                  DDB_G0285411    GO:0042981  geneIds4
                  DDB_G0289445    GO:0042981  geneIds5
                  DDB_G0294046    GO:0042981  geneIds6
                  0.498871327 GO:0042981  pvalue
                  1.641005803 GO:0042981  oddsRatio
                  0.652256034 GO:0042981  expCount
                  DDB_G0291275    GO:0006873  geneIds20
                  DDB_G0291858    GO:0006873  geneIds21
                  DDB_G0292412    GO:0006873  geneIds22
                  DDB_G0292564    GO:0006873  geneIds23
                  0.929591794 GO:0006873  pvalue
                  0.371461227 GO:0006873  oddsRatio
                  2.500314795 GO:0006873  expCount
                  DDB_G0270658    GO:0009725  geneIds1
                  DDB_G0280961    GO:0009725  geneIds2
                  DDB_G0289445    GO:0009725  geneIds3
                  0.292013941 GO:0009725  pvalue
                  4.105415861 GO:0009725  oddsRatio
                  0.326128017 GO:0009725  expCount", header = TRUE)

# split on ID, then get columns from desc
res <- do.call(rbind,
               lapply(split(df1, df1$ID), function(i){
                 x1 <- head(i, -3)
                 x2 <- tail(i, 3)
                 x3 <- as.data.frame(t(x2[, "a", drop = FALSE]))
                 colnames(x3) <- x2$desc
                 cbind(x1, x3)
               }))
rownames(res) <- NULL

# output
res
#               a         ID      desc      pvalue   oddsRatio    expCount
# 1  DDB_G0291275 GO:0006873 geneIds20 0.929591794 0.371461227 2.500314795
# 2  DDB_G0291858 GO:0006873 geneIds21 0.929591794 0.371461227 2.500314795
# 3  DDB_G0292412 GO:0006873 geneIds22 0.929591794 0.371461227 2.500314795
# 4  DDB_G0292564 GO:0006873 geneIds23 0.929591794 0.371461227 2.500314795
# 5  DDB_G0270658 GO:0009725  geneIds1 0.292013941 4.105415861 0.326128017
# 6  DDB_G0280961 GO:0009725  geneIds2 0.292013941 4.105415861 0.326128017
# 7  DDB_G0289445 GO:0009725  geneIds3 0.292013941 4.105415861 0.326128017
# 8  DDB_G0284157 GO:0042981  geneIds1 0.498871327 1.641005803 0.652256034
# 9  DDB_G0285005 GO:0042981  geneIds2 0.498871327 1.641005803 0.652256034
# 10 DDB_G0285251 GO:0042981  geneIds3 0.498871327 1.641005803 0.652256034
# 11 DDB_G0285411 GO:0042981  geneIds4 0.498871327 1.641005803 0.652256034
# 12 DDB_G0289445 GO:0042981  geneIds5 0.498871327 1.641005803 0.652256034
# 13 DDB_G0294046 GO:0042981  geneIds6 0.498871327 1.641005803 0.652256034
ADD COMMENT
3
Entering edit mode
6.1 years ago
Chirag Parsania ★ 2.0k

Tidy way

## input
df1 <- read.table(text = "a   ID  desc
                  DDB_G0284157    GO:0042981  geneIds1
                  DDB_G0285005    GO:0042981  geneIds2
                  DDB_G0285251    GO:0042981  geneIds3
                  DDB_G0285411    GO:0042981  geneIds4
                  DDB_G0289445    GO:0042981  geneIds5
                  DDB_G0294046    GO:0042981  geneIds6
                  0.498871327 GO:0042981  pvalue
                  1.641005803 GO:0042981  oddsRatio
                  0.652256034 GO:0042981  expCount
                  DDB_G0291275    GO:0006873  geneIds20
                  DDB_G0291858    GO:0006873  geneIds21
                  DDB_G0292412    GO:0006873  geneIds22
                  DDB_G0292564    GO:0006873  geneIds23
                  0.929591794 GO:0006873  pvalue
                  0.371461227 GO:0006873  oddsRatio
                  2.500314795 GO:0006873  expCount
                  DDB_G0270658    GO:0009725  geneIds1
                  DDB_G0280961    GO:0009725  geneIds2
                  DDB_G0289445    GO:0009725  geneIds3
                  0.292013941 GO:0009725  pvalue
                  4.105415861 GO:0009725  oddsRatio
                  0.326128017 GO:0009725  expCount", header = TRUE) %>% as_tibble()


## output
library(tidyverse)
out <- df1 %>% 
 spread(key = desc , value = a) %>% ## separate gene ids from other variable 
 gather(key = "vars" , value = "val" , c(expCount,pvalue,oddsRatio)) %>% ## take numeric vars (pvalue ,oddsRatio, expCount) under one column
  gather(key = "desc" , value = "gene_id" , geneIds1:geneIds6)  %>% ##  desc and gene id in separate col
  spread(key = vars , value = val) ## spread numeric vars 

> out 
# A tibble: 30 x 6
   ID         desc      gene_id      expCount    oddsRatio   pvalue     
   <fct>      <chr>     <chr>        <chr>       <chr>       <chr>      
 1 GO:0006873 geneIds1  NA           2.500314795 0.371461227 0.929591794
 2 GO:0006873 geneIds2  NA           2.500314795 0.371461227 0.929591794
 3 GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
 4 GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
 5 GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
 6 GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
 7 GO:0006873 geneIds3  NA           2.500314795 0.371461227 0.929591794
 8 GO:0006873 geneIds4  NA           2.500314795 0.371461227 0.929591794
 9 GO:0006873 geneIds5  NA           2.500314795 0.371461227 0.929591794
10 GO:0006873 geneIds6  NA           2.500314795 0.371461227 0.929591794
# ... with 20 more rows
ADD COMMENT
1
Entering edit mode

Tiny update:

NA from gene_id column should be removed. Final code would be

out <- df1 %>% spread(key = desc , value = a) %>% ## separate gene ids from other variable 
  gather(key = "vars" , value = "val" , c(expCount,pvalue,oddsRatio)) %>% ## take numeric vars (pvalue ,oddsRatio, expCount) under one column
  gather(key = "desc" , value = "gene_id" , geneIds1:geneIds6)  %>% ##  desc and gene id in separate col
  spread(key = vars , value = val) %>% filter(!is.na(gene_id))

>out 
# A tibble: 13 x 6
   ID         desc      gene_id      expCount    oddsRatio   pvalue     
   <fct>      <chr>     <chr>        <chr>       <chr>       <chr>      
 1 GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
 2 GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
 3 GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
 4 GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
 5 GO:0009725 geneIds1  DDB_G0270658 0.326128017 4.105415861 0.292013941
 6 GO:0009725 geneIds2  DDB_G0280961 0.326128017 4.105415861 0.292013941
 7 GO:0009725 geneIds3  DDB_G0289445 0.326128017 4.105415861 0.292013941
 8 GO:0042981 geneIds1  DDB_G0284157 0.652256034 1.641005803 0.498871327
 9 GO:0042981 geneIds2  DDB_G0285005 0.652256034 1.641005803 0.498871327
10 GO:0042981 geneIds3  DDB_G0285251 0.652256034 1.641005803 0.498871327
11 GO:0042981 geneIds4  DDB_G0285411 0.652256034 1.641005803 0.498871327
12 GO:0042981 geneIds5  DDB_G0289445 0.652256034 1.641005803 0.498871327
13 GO:0042981 geneIds6  DDB_G0294046 0.652256034 1.641005803 0.498871327
ADD REPLY
1
Entering edit mode
df1 %>% 
    spread(desc,-desc) %>%
    gather(Genes,a,-c(pvalue,ID,oddsRatio,expCount),na.rm = T) %>%
    arrange(ID) %>% 
    select(ID,Genes,a,expCount:pvalue)

           ID     Genes            a    expCount   oddsRatio      pvalue
1  GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
2  GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
3  GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
4  GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
5  GO:0009725  geneIds1 DDB_G0270658 0.326128017 4.105415861 0.292013941
6  GO:0009725  geneIds2 DDB_G0280961 0.326128017 4.105415861 0.292013941
7  GO:0009725  geneIds3 DDB_G0289445 0.326128017 4.105415861 0.292013941
8  GO:0042981  geneIds1 DDB_G0284157 0.652256034 1.641005803 0.498871327
9  GO:0042981  geneIds2 DDB_G0285005 0.652256034 1.641005803 0.498871327
10 GO:0042981  geneIds3 DDB_G0285251 0.652256034 1.641005803 0.498871327
11 GO:0042981  geneIds4 DDB_G0285411 0.652256034 1.641005803 0.498871327
12 GO:0042981  geneIds5 DDB_G0289445 0.652256034 1.641005803 0.498871327
13 GO:0042981  geneIds6 DDB_G0294046 0.652256034 1.641005803 0.498871327
ADD REPLY
0
Entering edit mode

Good work, Chirag!

ADD REPLY
0
Entering edit mode

Thanks, Kevin.......!

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6