Question

getting intersect between two lists of genes in R

2

Entering edit mode

9.5 years ago

zizigolu ★ 4.3k

Friends,

I have a microarray dataset AGI IDs in row and samples in column and txt file of transcription factor (AGI IDs), my supervisor asked me to find and extract the expression of only genes existed in my transcription factor list in another word I should extract the rows those are transcription factor. as already I need your help

Thank you

gene R • 33k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

Ram · Accepted Answer · 2015-10-27

4

Entering edit mode

9.5 years ago

Alex Reynolds 36k

Take a look at ?intersect.

> v1 <- c("GATA1", "CTCF", "HOXB1")
> v2 <- c("C3", "ATF", "SOX1", "CTCF", "OCT3")
> intersect(v1, v2)
[1] "CTCF"
>

If you are trying to build a subset of a data frame, you can use %in%. See the examples here.

ADD COMMENT • link 9.5 years ago by Alex Reynolds 36k

0

Entering edit mode

sorry Alex,

I did like below but the input file contains the same rows as my microarray dataset while I need only the rows shared between two files

> setwd("E:/Affy data Col-0 priming/detailed_results")
> mycounts <- read.table("raw_rma_normalized_expression_values.txt", sep="\t", header=TRUE)
> S <- read.table("excluding.txt", sep="", header=F)
> v1 <- c(mycounts)
> v2 <- c(S)
> intersect(v1, v2)
list()
> list()
list()
> write.table(mycounts, file = "final_list.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
>

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

What you are trying to pass to intersect is not what that function takes as input. Run ?intersect to see what it takes as input. Then look again at tables mycounts and S and decide what rows or columns you need to extract from those tables to pass to intersect.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Alex Reynolds 36k

0

Entering edit mode

thank you, but mycounts has 22000 rows and S has 100 rows, then I need 100 rows from mycounts which the same with S

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

You need to extract whichever row or column from mycounts contains your genes of interest. You need to do the same for S. Then pass those to intersect.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Alex Reynolds 36k

0

Entering edit mode

sorry, I donno how to extract...then I should extract 22000 rows from one file and 100 rows from another one then pass those to intersect.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Perhaps you are trying to take a subset of one table, based on values in a second table. It's not clear to me what you are trying to do. Maybe you need to use the %in% operator. Perhaps read the edit to my answer and go through the linked examples, as well as search online about this operator.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex,

I am only trying to extract 100 genes from 22000 genes. 100 genes are in S file and 22000 genes are in my microarray dataset

anyway thank you

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Alex gave you the correct hint. You only need to select the corresponding column IDs:

intersect(v1$id, v2$probe_id)

In case the columns are named id and probe_id

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by michael.ante ★ 4.0k

0

Entering edit mode

thank you

I did like below

mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> S <- read.table("excluding.txt", sep="", header=F)
> v1 <- c(mycounts)
> v2 <- c(S)
> intersect(v1$identifier, v2$IDs)
NULL
>

what is the NULL??

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Try to leave out the c(mycounts), this will mess up your data...

intersect(mycounts$identifier,S$IDs)

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by michael.ante ★ 4.0k

0

Entering edit mode

thank you,

> mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> S <- read.table("excluding.txt", sep="", header=F)
> intersect(mycounts$identifier,S$IDs)
NULL
> write.table(mycounts, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE)

I open the output file but it has 21000 row while I need only the 100 genes which I name with S variable

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

2

Entering edit mode

In case of reading S, you state header=F. Which means you have S$V1 representing the first column.

Please have a look at your data (e.g. with head(S)) and what data type is used.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by michael.ante ★ 4.0k

0

Entering edit mode

sorry, the result was the same

> head(S)
        IDs
1 AT1G53540
2 AT4G10250
3 AT5G12020
4 AT4G27670
5 AT5G59720
6 AT1G18970

> setwd("E:/Affy data Col-0 priming")
> raw_rma_normalized_expression_values <- read.delim("E:/Affy data Col-0 priming/raw_rma_normalized_expression_values.txt", header=FALSE)
>   View(raw_rma_normalized_expression_values)
> excluding <- read.csv("E:/Affy data Col-0 priming/excluding.txt", sep="")
>   View(excluding)
> mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> S <- read.table("excluding.txt", sep="", header=T)
> intersect(mycounts$identifier,S$IDs)
character(0)
> write.table(mycounts, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE)
> mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> S <- read.table("excluding.txt", sep="", header=F)
> S <- read.table("excluding.txt", sep="", header=T)
> v1 <- c(mycounts)
> v2 <- c(S)
> subset(mycounts, identifier %in% S$V1)
Error in match(x, table, nomatch = 0L) : object 'identifier' not found
> intersect(mycounts$identifier,S$IDs)
character(0)
> subset(mycounts, identifier %in% S$V1)
Error in match(x, table, nomatch = 0L) : object 'identifier' not found
>

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

subset(mycounts, identifier %in% S$V1)

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

subset(mycounts, identifier %in% S$V1)
Error in match(x, table, nomatch = 0L) : object 'identifier' not found

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

identifier: name of the column containing IDs in your mycounts dataset
S$V1: the first column of your S dataset

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Yes but told Error in match(x, table, nomatch = 0L) : object 'identifier' not found

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Just substitute identifier with the name of the id column in your mycounts data frame. The error you get means that there is no column called identifier in your mycounts dataframe. It would be easier to help you posted at least the first lines of the datasets you want to intersect.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

sorry, look at the below please

> head(mycounts)
  Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
1  AT1G01010      5.376319      5.781676      5.205693      5.753107      6.853383
2  AT1G01030      5.042002      5.449904      4.823182      5.340101      4.481017
3  AT1G01040      5.715077      7.453879      5.263682      7.668929      7.782487
4  AT1G01050      9.355328      9.922944      9.380094      9.816947     10.250628
5  AT1G01060      9.167021     11.080291      9.009276     10.990459      7.695344
6  AT1G01070      5.580228      6.058605      5.448064      5.858959      7.468354

then what is the solution?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

subset(mycounts, Identifier %in% S$V1)  # upper-case Identifier

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

thank you, no error anymore

but the output file contains 22000 yet while my purpose here is extracting only 100 rows (the same excluding gene list) from those 22000 rows in mycounts. maybe intersection means different and I should apply another word for my question. something like removing all rows except the 100 genes in excluding file.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

How many rows there are in dataset S? Can you post the first lines?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

totally 102 lines

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

The code above should extract only the 102 genes in the mycounts dataset. See this example:

> mycounts = read.table('mycounts.csv')
> mycounts
  Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
1  AT1G01010      5.376319      5.781676      5.205693      5.753107      6.853383
2  AT1G01030      5.042002      5.449904      4.823182      5.340101      4.481017
3  AT1G01040      5.715077      7.453879      5.263682      7.668929      7.782487
4  AT1G01050      9.355328      9.922944      9.380094      9.816947     10.250628
5  AT1G01060      9.167021     11.080291      9.009276     10.990459      7.695344
6  AT1G01070      5.580228      6.058605      5.448064      5.858959      7.468354
> S = read.table('genes_to_keep.csv', header=F)
> S
         V1
1 AT1G01030
2 AT1G01060
> subset(mycounts, Identifier %in% S$V1)
  Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
2  AT1G01030      5.042002      5.449904      4.823182      5.340101      4.481017
5  AT1G01060      9.167021     11.080291      9.009276     10.990459      7.695344

I am not sure if this is what you are asking, but you can do the inverse filter by using the ! operator:

> subset(mycounts, ! Identifier %in% S$V1)
  Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
1  AT1G01010      5.376319      5.781676      5.205693      5.753107      6.853383
3  AT1G01040      5.715077      7.453879      5.263682      7.668929      7.782487
4  AT1G01050      9.355328      9.922944      9.380094      9.816947     10.250628
6  AT1G01070      5.580228      6.058605      5.448064      5.858959      7.468354

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Note that subset will only save the filtered dataset on the screen. To save it to a file, you need to assign it to a variable and save it:

> mycounts.filtered = subset(mycounts, Identifier %in% S$V1)
> write.table(mycounts.filtered, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE)

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Giovanni M Dall'Olio 28k