How to combine multiple data files by column names?
3
2
Entering edit mode
5.7 years ago
John ▴ 270

Hi there,

How can I import around 100 RSEM result files to R and combine their TPM values(column names should be file name) to single matrix by simple code?

RNA-Seq rna-seq R • 5.9k views
ADD COMMENT
1
Entering edit mode

Make the most of the file.list() command, and then import them via a for or foreach loop. foreach can be parallelised when used with the %dopar% operator. The actual command to read in could be read.table() or fread().

Then, Bob's your uncle.

ADD REPLY
5
Entering edit mode
5.7 years ago

This will do it.

DF = do.call(cbind,
         lapply( list.files(pattern=".*genes.results"),
                     FUN=function(x) { 
            aColumn = read.table(x,header=T)[,c("gene_id", "TPM")];
            colnames(aColumn)[2] = x;
            aColumn;
             }
            )
        )
DF = DF[,!duplicated(colnames(DF))]

Result:

             gene_id GSM2537147.genes.results GSM2537148.genes.results
1 ENSMUSG00000000001                    31.44                    29.18
2 ENSMUSG00000000003                     0.00                     0.00
3 ENSMUSG00000000028                     1.30                     1.93
4 ENSMUSG00000000031                     0.82                     0.32
5 ENSMUSG00000000037                     0.71                     0.43
6 ENSMUSG00000000049                     0.29                     0.71
  GSM2537149.genes.results GSM2537150.genes.results GSM2537151.genes.results
1                    32.22                    30.51                    28.42
2                     0.00                     0.00                     0.00
3                     0.04                     2.17                     1.34
4                     0.00                     0.39                     0.05
5                     0.66                     0.72                     0.53
6                     0.00                     1.33                     0.41
  GSM2537152.genes.results GSM2537153.genes.results GSM2537154.genes.results
1                    34.46                    28.95                    32.44
2                     0.00                     0.00                     0.00
3                     2.95                     1.46                     1.34
4                     0.18                     0.74                     0.00
5                     0.43                     0.50                     0.34
6                     0.14                     0.72                     0.38
  GSM2537155.genes.results GSM2537156.genes.results GSM2537157.genes.results
1                    27.64                    30.24                    26.87
2                     0.00                     0.00                     0.00
3                     1.96                     2.20                     1.40
4                     0.13                     0.19                     0.44
5                     0.76                     1.46                     0.43
6                     0.83                     0.30                     0.95
  GSM2537158.genes.results GSM2537159.genes.results GSM2537160.genes.results
1                    27.96                    29.52                    28.74
2                     0.00                     0.00                     0.00
3                     2.01                     1.18                     1.81
4                     0.19                     0.25                     0.35
5                     0.42                     0.88                     0.67
6                     0.25                     0.27                     0.41
  GSM2537161.genes.results
1                    31.17
2                     0.00
3                     2.24
4                     0.11
5                     0.40
6                     0.83
ADD COMMENT
1
Entering edit mode

Awesome!! Thanks a lot

ADD REPLY
2
Entering edit mode
5.7 years ago

A general solution will be something along the following lines

# create an empty dataframe
data<-NULL
# iterate through file names
for (f in c("file1","file2")){
  # open each file
  file<-read.table(f)
  # append specific column of the file to the dataframe
  data<-cbind(data,file[,1])
}
#rename column names
colnames(data)<-c("file1","file2")

If there are many files, you can write their name into a separate file, and read the names from that file.

ADD COMMENT
3
Entering edit mode

just a small comment concerning your code. It's not a good practice to put a cbind within a loop (not very effective). It's faster to create a list ( data <- list()) before the loop. Then replace data<-cbind(data,file[,1]) by data[[i]] <- file[,1] and do a data <- do.call(cbind,data) after the loop.

ADD REPLY
1
Entering edit mode

Yes, avoid growing objects in a loop, and create a list with predefined length: myList <- vector("list", length = length(list.files(...))

ADD REPLY
0
Entering edit mode

good to know, thank you!

ADD REPLY
1
Entering edit mode
5.7 years ago
zx8754 12k

Paste all the files side by side, then import:

library(data.table)

myData <- fread("paste *.genes.results")
ADD COMMENT

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6