Creating Count Matrix
2
0
Entering edit mode
5.8 years ago
lamia_203 ▴ 100

I have to create a count table with:

Gene ID   Sample1  Sample2  Sample3
ENSG...   297      0        0

So far I have files with their individual tables, it contains 4 columns with Gene ID, and the remaining three columns read counts. I want to take the second column from each sample and make a count matrix with each sample having their name. How can I make it so that each column has their respective sample?

Thanks

Linux RNA-Seq • 9.0k views
ADD COMMENT
0
Entering edit mode

What language/method would you like to use? This can be done using shell, R, python, perl, etc.....you name it. What are you familiar with?

ADD REPLY
0
Entering edit mode

Sorry I didn't specify. I'm familiar with R and Linux but prefer to make the count table in Linux since there are a lot of samples (about 3200).

ADD REPLY
3
Entering edit mode
5.8 years ago
seidel 11k

One suggestion using R would be:

# only return file names with a given pattern
dir(pattern="ReadsPerGene.out.tab")

# save the results to a variable
files <- dir(pattern="ReadsPerGene.out.tab")

counts <- c()
for( i in seq_along(files) ){
  x <- read.table(file=files[i], sep="\t", header=F, as.is=T)
  counts <- cbind(counts, x[,2])
}

# set the row names
rownames(counts) <- x[,1]
# set the column names based on input file names, with pattern removed
colnames(counts) <- sub("_ReadsPerGene.out.tab","",files)

This example assumes your results are each in a set of files with a pattern of ReadsPerGene.out.tab, as you might find using the STAR aligner.

ADD COMMENT
3
Entering edit mode
5.8 years ago
ahaswer ▴ 150

If you are using linux you can also use paste and awk in terminal like so:

paste sample1 sample2 sample3 | awk 'BEGIN {OFS="\t"; FS="\t"}; {print $1','$2','$4','$6}' > count_matrix

It will concatenate specified tables horizontally and extract specified columns. Works for tab-delimited files. If delimiter is different just specify it after "OFS" and "FS".

ADD COMMENT
1
Entering edit mode

Since you have a lot of samples (I guess you keep them in separate catalogue) it would be much more convenient to avoid specifying desired columns. Therefore you can open terminal in samples catalogue and run (assuming there are only sample files):

paste * | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

It will work as code above without selecting tons of columns ;)

ADD REPLY
0
Entering edit mode

This does use all the sample thank you. How does the loop for the i part work? When I ran this code, the count_matrix included two repeated columns from each sample rather than one.

Thanks

ADD REPLY
0
Entering edit mode

Maybe you have duplicates inside the catalogue or ran the code twice? Also check if the columns inside sample files are separated with one delimiter without multiplications. You can also use paste command selectively if, for instance, all sample files have the same extension:

paste *.txt | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

The loop: starting from second column, if "i" equal or less than number of fields (i. e. columns), add 2. For each loop iteration append new column to "j" with specified field separator.

ADD REPLY

Login before adding your answer.

Traffic: 1866 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6