Subset multiple columns in R or unix
2
0
Entering edit mode
5.7 years ago

I need to extract many columns from a dataset. I have a very large csv file with thousands of columns and rows. In R for example, I can read it in using:

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)

Each column is a gene name. I know how to extract specific columns from my R data.frame by using the basic code like this:

mydata[  , "GeneName1", "GeneName2"]

But my question is, how do I pull hundreds of gene names? Too many to type in? They are listed in a txt file.

I've used grep in UNIX before to pull multiple ROWS using a txt file with the list of genes I need, but I haven't been able to figure out how to do it with Columns.

subset pull columns R subset columns • 8.8k views
ADD COMMENT
0
Entering edit mode

Can you transpose the data frame and extract the resulting rows?

t_mydata<-t(mydata)
geneList <- read.table("your_geneList.txt")
subsampled_mydata <-  t_mydata[which( t_mydata$Gene %in% geneList),]

supposing there is a column Gene in your new t_mydata data frame

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
1
Entering edit mode
5.7 years ago

in R, you could simply subset the data.frame that is returned by read.csv:

> test <- data.frame(A = c(1:3), B = c(3:5), C = c(6:8))
> test
  A B C
1 1 3 6
2 2 4 7
3 3 5 8

## spell out the column names you're interested in
> test[, c("A","B")]
  A B
1 1 3
2 2 4
3 3 5

## or use grepl
> test[, grepl("[A|B]", names(test)) ]
ADD COMMENT
1
Entering edit mode
5.7 years ago

Read your genes list file and put it into a vector, then filter your dataframe using this vector

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)
genes_list <- scan("gene_list.txt", character(), quote = "")
mydata.new <- mydata[ ,genes_list]
ADD COMMENT
0
Entering edit mode

Bastien, this worked, and so simple. Thank you!

ADD REPLY
0
Entering edit mode

Bastien, one more question. Your code works well, but only if every gene on the list is found in the csv file. If R comes to a gene that is not there, it will quit. Is there something I can add to that last line to skip any genes that it does not find, and run the script anyway? The error I am getting is: Error in [.data.frame(Mydata, , gene_list) : undefined columns selected

ADD REPLY
1
Entering edit mode
mydata.new <- mydata[ ,intersect(genes_list,colnames(mydata))]
ADD REPLY
0
Entering edit mode

Thank you!! That worked.

ADD REPLY

Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6