Question

How to combine multiple data files by column names?

2

Entering edit mode

5.9 years ago

John ▴ 270

Hi there,

How can I import around 100 RSEM result files to R and combine their TPM values(column names should be file name) to single matrix by simple code?

RNA-Seq rna-seq R • 6.0k views

ADD COMMENT • link updated 5.9 years ago by zx8754 12k • written 5.9 years ago by John ▴ 270

1

Entering edit mode

Make the most of the file.list() command, and then import them via a for or foreach loop. foreach can be parallelised when used with the %dopar% operator. The actual command to read in could be read.table() or fread().

Then, Bob's your uncle.

ADD REPLY • link 5.9 years ago by Kevin Blighe 89k

score 5 · Answer 1 · 2019-03-31

This will do it.

DF = do.call(cbind,
         lapply( list.files(pattern=".*genes.results"),
                     FUN=function(x) { 
            aColumn = read.table(x,header=T)[,c("gene_id", "TPM")];
            colnames(aColumn)[2] = x;
            aColumn;
             }
            )
        )
DF = DF[,!duplicated(colnames(DF))]

Result:

             gene_id GSM2537147.genes.results GSM2537148.genes.results
1 ENSMUSG00000000001                    31.44                    29.18
2 ENSMUSG00000000003                     0.00                     0.00
3 ENSMUSG00000000028                     1.30                     1.93
4 ENSMUSG00000000031                     0.82                     0.32
5 ENSMUSG00000000037                     0.71                     0.43
6 ENSMUSG00000000049                     0.29                     0.71
  GSM2537149.genes.results GSM2537150.genes.results GSM2537151.genes.results
1                    32.22                    30.51                    28.42
2                     0.00                     0.00                     0.00
3                     0.04                     2.17                     1.34
4                     0.00                     0.39                     0.05
5                     0.66                     0.72                     0.53
6                     0.00                     1.33                     0.41
  GSM2537152.genes.results GSM2537153.genes.results GSM2537154.genes.results
1                    34.46                    28.95                    32.44
2                     0.00                     0.00                     0.00
3                     2.95                     1.46                     1.34
4                     0.18                     0.74                     0.00
5                     0.43                     0.50                     0.34
6                     0.14                     0.72                     0.38
  GSM2537155.genes.results GSM2537156.genes.results GSM2537157.genes.results
1                    27.64                    30.24                    26.87
2                     0.00                     0.00                     0.00
3                     1.96                     2.20                     1.40
4                     0.13                     0.19                     0.44
5                     0.76                     1.46                     0.43
6                     0.83                     0.30                     0.95
  GSM2537158.genes.results GSM2537159.genes.results GSM2537160.genes.results
1                    27.96                    29.52                    28.74
2                     0.00                     0.00                     0.00
3                     2.01                     1.18                     1.81
4                     0.19                     0.25                     0.35
5                     0.42                     0.88                     0.67
6                     0.25                     0.27                     0.41
  GSM2537161.genes.results
1                    31.17
2                     0.00
3                     2.24
4                     0.11
5                     0.40
6                     0.83

Nicolas Rosewick · Answer 2 · 2019-03-31

2

Entering edit mode

5.9 years ago

grant.hovhannisyan ★ 2.6k

A general solution will be something along the following lines

# create an empty dataframe
data<-NULL
# iterate through file names
for (f in c("file1","file2")){
  # open each file
  file<-read.table(f)
  # append specific column of the file to the dataframe
  data<-cbind(data,file[,1])
}
#rename column names
colnames(data)<-c("file1","file2")

If there are many files, you can write their name into a separate file, and read the names from that file.

ADD COMMENT • link updated 5.9 years ago by Nicolas Rosewick 11k • written 5.9 years ago by grant.hovhannisyan ★ 2.6k

3

Entering edit mode

just a small comment concerning your code. It's not a good practice to put a cbind within a loop (not very effective). It's faster to create a list ( data <- list()) before the loop. Then replace data<-cbind(data,file[,1]) by data[[i]] <- file[,1] and do a data <- do.call(cbind,data) after the loop.

ADD REPLY • link 5.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

Yes, avoid growing objects in a loop, and create a list with predefined length: myList <- vector("list", length = length(list.files(...))

ADD REPLY • link 5.9 years ago by zx8754 12k

0

Entering edit mode

good to know, thank you!

ADD REPLY • link 5.9 years ago by grant.hovhannisyan ★ 2.6k

score 1 · Answer 3 · 2019-04-01

1

Entering edit mode

5.9 years ago

zx8754 12k

Paste all the files side by side, then import:

library(data.table)

myData <- fread("paste *.genes.results")

ADD COMMENT • link 5.9 years ago by zx8754 12k