Hi,
I have a big file as below;
GeneID GO geneid
Could you please help me, to have a dataframe with 3 more columns for independently? So these values would come in front of corresponding genes and GO term
Thanks a lot for any help
Hi,
I have a big file as below;
GeneID GO geneid
Could you please help me, to have a dataframe with 3 more columns for independently? So these values would come in front of corresponding genes and GO term
Thanks a lot for any help
I think you can use awk
on the mixed-up file and match $1
to a floating point value [0-9.]+
, writing only those rows to a separate file. Then you can grep -Fvx
the newly generated file against the larger file to create a third file that excludes all numeric valued rows, containing only the GeneID-GO-geneid
rows.
Now you'd have two files with two different, internally consistent data formats that you can use any tool of your choice to process.
I'm not giving you a ready-made solution as I've addressed most important parts and you should be able to play around with awk
and grep
and get to the final solution.
Whatever works for you, I guess. These utilities are quite easy to learn, here is awk
syntax:
awk 'pattern { action }' in_file
pattern
is the part where you can use $1
, etc to pick where you want to do an operation/action, and action
is the operation/action you wish to perform. awk '$1==10 {print}'
prints where the first column has value 10
.
See this tl;dr page for a few examples.
Hi, @RamRS
Please don't be angry at me
Honestly, yesterday an office mate asked me to solve this problem for him but neither I had time nor enough knowledge to solve that; So, I asked my question here and biostars kindly solved my problem as always. I sent him the solution and he got pleased a lot because he does know too much less than me in such staffs. Suddenly, I thought by copying and pasting these code in google, he would simply find my post here. So, honestly, I was shy if he knew how I solved his problem. Therefore, I removed some part of data because gene IDs is specific to his working data. I am sorry for that and thank you for helping me
Believe, I am not a too bad person :( a bit stupid though
Za, there is nothing to be angry about. We know you're doing your best, I would nevertheless like for you to try a bit harder to learn basic unix utilities as they empower you and when you wield them well, they can give a silver lining to a bad day.
IMHO you should not take on tasks that you cannot handle, especially if there is an option to pass on them. At this point, you need to focus on your learning and avoid taking on time-bound tasks that you cannot learn from in a relaxed manner. If you try to do people favors by stretching yourself thin, you're not giving yourself the attention you deserve. Help others but not at your cost.
You're neither a bad person nor are you stupid. Don't call yourself names, that only affects your growth. You might feel stupid at times, that does not make you a stupid person.
I also came up with a scripting solution which I'll explain step by step (but won't throw the literal code since this is a good exercise for learning). I also assumed your input is tab-delimited, that all gene Ids start with "DDB" and that there is only 3 values per unique GO ID as shown in your post:
1- isolate the lines that contain pvalue, oddsratio and expcount with grep
in a new file
2- get a GO ID element from column 2 of the new file with awk
and print the corresponding $1
column (values themselves) if
the GO ID element matches $2
(GO IDs) in the new file, you should get 3 values.
3- paste
the 3 printed matches to a tmp file in a single row
4- define variables for each column in the tmp file (3 variables each one pointing to pvalue, oddsratio and expcount respectively)
5- grep
your original input for the GO ID that you are working with from step 2 and sed
the resulting lines with the 3 defined variables separated by tabs at the end of each line, append the output to a file that will be your output file.
6- this process describes what you need to do for a single GO ID, however you will notice it is possible to organize the whole thing starting from step 2 in a loop
so that you automate the process for a list of uniq
GO IDs that you can get in step 2.
Hope this helps, what you wanna do can be kinda tricky so don't feel discouraged.
By the time I posted this answer the data got removed, so I am using below as example input:
# example dataframe
df1 <- read.table(text = "a ID desc
DDB_G0284157 GO:0042981 geneIds1
DDB_G0285005 GO:0042981 geneIds2
DDB_G0285251 GO:0042981 geneIds3
DDB_G0285411 GO:0042981 geneIds4
DDB_G0289445 GO:0042981 geneIds5
DDB_G0294046 GO:0042981 geneIds6
0.498871327 GO:0042981 pvalue
1.641005803 GO:0042981 oddsRatio
0.652256034 GO:0042981 expCount
DDB_G0291275 GO:0006873 geneIds20
DDB_G0291858 GO:0006873 geneIds21
DDB_G0292412 GO:0006873 geneIds22
DDB_G0292564 GO:0006873 geneIds23
0.929591794 GO:0006873 pvalue
0.371461227 GO:0006873 oddsRatio
2.500314795 GO:0006873 expCount
DDB_G0270658 GO:0009725 geneIds1
DDB_G0280961 GO:0009725 geneIds2
DDB_G0289445 GO:0009725 geneIds3
0.292013941 GO:0009725 pvalue
4.105415861 GO:0009725 oddsRatio
0.326128017 GO:0009725 expCount", header = TRUE)
# split on ID, then get columns from desc
res <- do.call(rbind,
lapply(split(df1, df1$ID), function(i){
x1 <- head(i, -3)
x2 <- tail(i, 3)
x3 <- as.data.frame(t(x2[, "a", drop = FALSE]))
colnames(x3) <- x2$desc
cbind(x1, x3)
}))
rownames(res) <- NULL
# output
res
# a ID desc pvalue oddsRatio expCount
# 1 DDB_G0291275 GO:0006873 geneIds20 0.929591794 0.371461227 2.500314795
# 2 DDB_G0291858 GO:0006873 geneIds21 0.929591794 0.371461227 2.500314795
# 3 DDB_G0292412 GO:0006873 geneIds22 0.929591794 0.371461227 2.500314795
# 4 DDB_G0292564 GO:0006873 geneIds23 0.929591794 0.371461227 2.500314795
# 5 DDB_G0270658 GO:0009725 geneIds1 0.292013941 4.105415861 0.326128017
# 6 DDB_G0280961 GO:0009725 geneIds2 0.292013941 4.105415861 0.326128017
# 7 DDB_G0289445 GO:0009725 geneIds3 0.292013941 4.105415861 0.326128017
# 8 DDB_G0284157 GO:0042981 geneIds1 0.498871327 1.641005803 0.652256034
# 9 DDB_G0285005 GO:0042981 geneIds2 0.498871327 1.641005803 0.652256034
# 10 DDB_G0285251 GO:0042981 geneIds3 0.498871327 1.641005803 0.652256034
# 11 DDB_G0285411 GO:0042981 geneIds4 0.498871327 1.641005803 0.652256034
# 12 DDB_G0289445 GO:0042981 geneIds5 0.498871327 1.641005803 0.652256034
# 13 DDB_G0294046 GO:0042981 geneIds6 0.498871327 1.641005803 0.652256034
Tidy way
## input
df1 <- read.table(text = "a ID desc
DDB_G0284157 GO:0042981 geneIds1
DDB_G0285005 GO:0042981 geneIds2
DDB_G0285251 GO:0042981 geneIds3
DDB_G0285411 GO:0042981 geneIds4
DDB_G0289445 GO:0042981 geneIds5
DDB_G0294046 GO:0042981 geneIds6
0.498871327 GO:0042981 pvalue
1.641005803 GO:0042981 oddsRatio
0.652256034 GO:0042981 expCount
DDB_G0291275 GO:0006873 geneIds20
DDB_G0291858 GO:0006873 geneIds21
DDB_G0292412 GO:0006873 geneIds22
DDB_G0292564 GO:0006873 geneIds23
0.929591794 GO:0006873 pvalue
0.371461227 GO:0006873 oddsRatio
2.500314795 GO:0006873 expCount
DDB_G0270658 GO:0009725 geneIds1
DDB_G0280961 GO:0009725 geneIds2
DDB_G0289445 GO:0009725 geneIds3
0.292013941 GO:0009725 pvalue
4.105415861 GO:0009725 oddsRatio
0.326128017 GO:0009725 expCount", header = TRUE) %>% as_tibble()
## output
library(tidyverse)
out <- df1 %>%
spread(key = desc , value = a) %>% ## separate gene ids from other variable
gather(key = "vars" , value = "val" , c(expCount,pvalue,oddsRatio)) %>% ## take numeric vars (pvalue ,oddsRatio, expCount) under one column
gather(key = "desc" , value = "gene_id" , geneIds1:geneIds6) %>% ## desc and gene id in separate col
spread(key = vars , value = val) ## spread numeric vars
> out
# A tibble: 30 x 6
ID desc gene_id expCount oddsRatio pvalue
<fct> <chr> <chr> <chr> <chr> <chr>
1 GO:0006873 geneIds1 NA 2.500314795 0.371461227 0.929591794
2 GO:0006873 geneIds2 NA 2.500314795 0.371461227 0.929591794
3 GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
4 GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
5 GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
6 GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
7 GO:0006873 geneIds3 NA 2.500314795 0.371461227 0.929591794
8 GO:0006873 geneIds4 NA 2.500314795 0.371461227 0.929591794
9 GO:0006873 geneIds5 NA 2.500314795 0.371461227 0.929591794
10 GO:0006873 geneIds6 NA 2.500314795 0.371461227 0.929591794
# ... with 20 more rows
Tiny update:
NA
from gene_id column should be removed. Final code would be
out <- df1 %>% spread(key = desc , value = a) %>% ## separate gene ids from other variable
gather(key = "vars" , value = "val" , c(expCount,pvalue,oddsRatio)) %>% ## take numeric vars (pvalue ,oddsRatio, expCount) under one column
gather(key = "desc" , value = "gene_id" , geneIds1:geneIds6) %>% ## desc and gene id in separate col
spread(key = vars , value = val) %>% filter(!is.na(gene_id))
>out
# A tibble: 13 x 6
ID desc gene_id expCount oddsRatio pvalue
<fct> <chr> <chr> <chr> <chr> <chr>
1 GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
2 GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
3 GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
4 GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
5 GO:0009725 geneIds1 DDB_G0270658 0.326128017 4.105415861 0.292013941
6 GO:0009725 geneIds2 DDB_G0280961 0.326128017 4.105415861 0.292013941
7 GO:0009725 geneIds3 DDB_G0289445 0.326128017 4.105415861 0.292013941
8 GO:0042981 geneIds1 DDB_G0284157 0.652256034 1.641005803 0.498871327
9 GO:0042981 geneIds2 DDB_G0285005 0.652256034 1.641005803 0.498871327
10 GO:0042981 geneIds3 DDB_G0285251 0.652256034 1.641005803 0.498871327
11 GO:0042981 geneIds4 DDB_G0285411 0.652256034 1.641005803 0.498871327
12 GO:0042981 geneIds5 DDB_G0289445 0.652256034 1.641005803 0.498871327
13 GO:0042981 geneIds6 DDB_G0294046 0.652256034 1.641005803 0.498871327
df1 %>%
spread(desc,-desc) %>%
gather(Genes,a,-c(pvalue,ID,oddsRatio,expCount),na.rm = T) %>%
arrange(ID) %>%
select(ID,Genes,a,expCount:pvalue)
ID Genes a expCount oddsRatio pvalue
1 GO:0006873 geneIds20 DDB_G0291275 2.500314795 0.371461227 0.929591794
2 GO:0006873 geneIds21 DDB_G0291858 2.500314795 0.371461227 0.929591794
3 GO:0006873 geneIds22 DDB_G0292412 2.500314795 0.371461227 0.929591794
4 GO:0006873 geneIds23 DDB_G0292564 2.500314795 0.371461227 0.929591794
5 GO:0009725 geneIds1 DDB_G0270658 0.326128017 4.105415861 0.292013941
6 GO:0009725 geneIds2 DDB_G0280961 0.326128017 4.105415861 0.292013941
7 GO:0009725 geneIds3 DDB_G0289445 0.326128017 4.105415861 0.292013941
8 GO:0042981 geneIds1 DDB_G0284157 0.652256034 1.641005803 0.498871327
9 GO:0042981 geneIds2 DDB_G0285005 0.652256034 1.641005803 0.498871327
10 GO:0042981 geneIds3 DDB_G0285251 0.652256034 1.641005803 0.498871327
11 GO:0042981 geneIds4 DDB_G0285411 0.652256034 1.641005803 0.498871327
12 GO:0042981 geneIds5 DDB_G0289445 0.652256034 1.641005803 0.498871327
13 GO:0042981 geneIds6 DDB_G0294046 0.652256034 1.641005803 0.498871327
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
dput would help better understanding df.
Sorry, help me please
Instead of placed in below rows of genes, I want them independently in 3 columns
For example ideal results would be
Where is the data gone?