So far I have files with their individual tables, it contains 4 columns with Gene ID, and the remaining three columns read counts. I want to take the second column from each sample and make a count matrix with each sample having their name. How can I make it so that each column has their respective sample?
# only return file names with a given pattern
dir(pattern="ReadsPerGene.out.tab")
# save the results to a variable
files <- dir(pattern="ReadsPerGene.out.tab")
counts <- c()
for( i in seq_along(files) ){
x <- read.table(file=files[i], sep="\t", header=F, as.is=T)
counts <- cbind(counts, x[,2])
}
# set the row names
rownames(counts) <- x[,1]
# set the column names based on input file names, with pattern removed
colnames(counts) <- sub("_ReadsPerGene.out.tab","",files)
This example assumes your results are each in a set of files with a pattern of ReadsPerGene.out.tab, as you might find using the STAR aligner.
It will concatenate specified tables horizontally and extract specified columns. Works for tab-delimited files. If delimiter is different just specify it after "OFS" and "FS".
Since you have a lot of samples (I guess you keep them in separate catalogue) it would be much more convenient to avoid specifying desired columns. Therefore you can open terminal in samples catalogue and run (assuming there are only sample files):
This does use all the sample thank you. How does the loop for the i part work? When I ran this code, the count_matrix included two repeated columns from each sample rather than one.
Maybe you have duplicates inside the catalogue or ran the code twice? Also check if the columns inside sample files are separated with one delimiter without multiplications. You can also use paste command selectively if, for instance, all sample files have the same extension:
The loop: starting from second column, if "i" equal or less than number of fields (i. e. columns), add 2. For each loop iteration append new column to "j" with specified field separator.
What language/method would you like to use? This can be done using shell, R, python, perl, etc.....you name it. What are you familiar with?
Sorry I didn't specify. I'm familiar with R and Linux but prefer to make the count table in Linux since there are a lot of samples (about 3200).