Dear All,
First of all my apologies if this has been answered somewhere else and I haven't understood it properly, but I am having a somewhat irritating problem with merging single-cell RNA seq datasets.
I know that across donors it's probably best to perform integration, but I wanted to see what happens when one performs merging.
So, the dataset I am trying to merge contains 64 .txt files (dataset is the 'supplementary file' data: GSE120180_RAW.tar, which can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120180).
as an example, running:
dataset1 <- read.delim("~/GSM3395418_CE01_I1_UMI_count.txt", sep="", header = TRUE)
dataset1[c(1:10), 1:3]
yields:
Gene YF1_I_sc.1 YF1_I_sc.2
1 A2M 0 0
2 A2ML1 0 0
3 A3GALT2 0 0
4 A4GALT 0 0
5 AAAS 2 0
6 AACS 15 12
7 AADAC 0 0
8 AADACL2 0 0
9 AADACL3 1 0
10 AADACL4 0 0
Now, how can I merge all of these datasets, based on the "Gene" column, assuming:
- Columns across multiple files may be named similarly, for example there may be a "YF1_I_sc.1" in another file.
- Some files may have different "numbers" of total genes, for example the length of the gene column may vary across samples.
- If the code you suggest generates NAs how to remove the NAs and replace with 0s.
It would be helpful to create a list I feel, so I ran this to create a list containing 64 dataframes:
file_list <- list.files() #where list.files points towards a wkdir containing the 64 .txt files
list_of_files <- lapply(file_list, read.delim)
however, when I run merged_nhp<- merge_all(list_of_files, by = "Gene"
I get only one file containing 97 cells, wheras there are ~5k cells, across the 64 experiments.
Thanks for the help!
Best,
Fahd
Thanks for the comment @rpolicastro ! When I run your code, I get the following error:
Traceback:
My current working directory has the 64 files in it. Any Idea why the function requires the data to be a dataframe?
Try loading just one file in to see if it returns a proper data.frame.
cts <- read_tsv(list_of_files[1])
. If the files are in your working directory also make sure you don't have the 3 dots inlist.files()
. I just had that as a reminder to put that path where the files are stored.Ok, so sorry I didn't read the first two lines of your first comment, just to elaborate, my goal is to perform non-linear multidimensional reduction, on this dataset GSE120180, via either merging or integrating each of the donors. This dataset consists of 16 donors that were single-cell sequenced in what appears to be batches of 4 each time (it's unclear whether these are technical replicates or not, but based on the total cell number in the paper, I am assuming that the sequencing was performed in batches of 4).
My first goal is to perform merging, where my assumption is that donor heterogeneity will be compensated due to strong cellular phenotypes as it is unclear whether the authors used integration or mergeing. In any case, this will be useful, because each of the donor 'batches' will have to be merged forming 16 files/dfs prior to generating seurat objects and integrating.
Now coming back to your solution, I ran
cts <- list_of_files[[1]]
because when you runlist_of_files <- lapply(file_list, read.delim)
you get a list of 64 dfs. But just to check anyway, I rancts <- list_of_files[[1]]
and thencts[c(1:10), 1:3]
to get the following:Also
is.data.frame(cts)
yields aTRUE
I would like to point out, it's my understanding of thelist.files
here is a character vector of file names, whilelist_of_files
is a list of dfs (please correct me if I am wrong). Thanks alot, im sure you have better things to do, but in the past I have been manually creating individual seurat objects and then merging them all usingmerge
in seurat, and quite honestly I think now that thats just dumb. Im hoping your solutions will work and allow me to modify it to merge large datasets, organized as lists of dataframes (dfs) with ease. Any suggestions? Thanks again for the help.Alright, so I think I see what happened. According to what you said you used
list_of_files
, which is a list of data.frames as input toimap
. I was starting with a named vector of file names. If you want to start withlist_of_files
you can run this slightly modified code.Also don't worry about asking questions! As long as you are putting thought into them no one minds.
WOW. That was dumb of me. I just realized what was going on... just to clarify the sequence of events:
Thanks alot @rpolicastro ! Your solution works like a charm. It Also allows you to define which column is the gene list. This helps when working with Ensembl Ids etc. Cheers have a great day!
Your comments in the code look good. Glad I could help!
Ok, so this may be incorrect but I think it may be a solution.
Using the
Reduce
base R function:However I know that
Reduce()
combines from the left, and since the first column is a gene list, its easy to combine I guess. But what if the gene list is within the dataset? More control would be favorable I guess....