Question

Creating Count Matrix

0

Entering edit mode

5.8 years ago

lamia_203 ▴ 100

I have to create a count table with:

Gene ID   Sample1  Sample2  Sample3
ENSG...   297      0        0

So far I have files with their individual tables, it contains 4 columns with Gene ID, and the remaining three columns read counts. I want to take the second column from each sample and make a count matrix with each sample having their name. How can I make it so that each column has their respective sample?

Thanks

Linux RNA-Seq • 9.0k views

ADD COMMENT • link updated 5.8 years ago by ahaswer ▴ 150 • written 5.8 years ago by lamia_203 ▴ 100

0

Entering edit mode

What language/method would you like to use? This can be done using shell, R, python, perl, etc.....you name it. What are you familiar with?

ADD REPLY • link 5.8 years ago by seidel 11k

0

Entering edit mode

Sorry I didn't specify. I'm familiar with R and Linux but prefer to make the count table in Linux since there are a lot of samples (about 3200).

ADD REPLY • link 5.8 years ago by lamia_203 ▴ 100

score 3 · Answer 1 · 2019-01-26

One suggestion using R would be:

# only return file names with a given pattern
dir(pattern="ReadsPerGene.out.tab")

# save the results to a variable
files <- dir(pattern="ReadsPerGene.out.tab")

counts <- c()
for( i in seq_along(files) ){
  x <- read.table(file=files[i], sep="\t", header=F, as.is=T)
  counts <- cbind(counts, x[,2])
}

# set the row names
rownames(counts) <- x[,1]
# set the column names based on input file names, with pattern removed
colnames(counts) <- sub("_ReadsPerGene.out.tab","",files)

This example assumes your results are each in a set of files with a pattern of ReadsPerGene.out.tab, as you might find using the STAR aligner.

score 3 · Answer 2 · 2019-01-26

3

Entering edit mode

5.8 years ago

ahaswer ▴ 150

If you are using linux you can also use paste and awk in terminal like so:

paste sample1 sample2 sample3 | awk 'BEGIN {OFS="\t"; FS="\t"}; {print $1','$2','$4','$6}' > count_matrix

It will concatenate specified tables horizontally and extract specified columns. Works for tab-delimited files. If delimiter is different just specify it after "OFS" and "FS".

ADD COMMENT • link 5.8 years ago by ahaswer ▴ 150

1

Entering edit mode

Since you have a lot of samples (I guess you keep them in separate catalogue) it would be much more convenient to avoid specifying desired columns. Therefore you can open terminal in samples catalogue and run (assuming there are only sample files):

paste * | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

It will work as code above without selecting tons of columns ;)

ADD REPLY • link 5.8 years ago by ahaswer ▴ 150

0

Entering edit mode

This does use all the sample thank you. How does the loop for the i part work? When I ran this code, the count_matrix included two repeated columns from each sample rather than one.

Thanks

ADD REPLY • link 5.8 years ago by lamia_203 ▴ 100

0

Entering edit mode

Maybe you have duplicates inside the catalogue or ran the code twice? Also check if the columns inside sample files are separated with one delimiter without multiplications. You can also use paste command selectively if, for instance, all sample files have the same extension:

paste *.txt | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

The loop: starting from second column, if "i" equal or less than number of fields (i. e. columns), add 2. For each loop iteration append new column to "j" with specified field separator.

ADD REPLY • link 5.8 years ago by ahaswer ▴ 150