htseq-counts output merge into one matrix ??
8
3
Entering edit mode
10.1 years ago

Dear all,

I just need a little help to merge my all features counts into one matrix. I have counted features using htseq-counts and now want to merge into one file like....

ID c1 c2 c3..............t1 t2 t2.......etc

The problem is that I have 36 sample and corresponding counts so I am lazy to write ...

paste *.txt | cut -d -f1,2,4,6,8...............................72 #(because each file contain two column i.e ID and counts)

and moreover, apart from laziness, one problem also happened here that while pasting its not in sequence like:

paste 1.txt 2.txt 3.txt............

Help me please...

rna-seq • 24k views
ADD COMMENT
2
Entering edit mode

Finally got a solution to this problem

#!/bin/bash
FILES=$(ls -t -v *.txt | tr '\n' ' ');
awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' $FILES > merged.tmp

This woks like you asked.

ADD REPLY
0
Entering edit mode

Thank you man for your kind reply but I have already done, I am sorry for the delay. Was just needed to change the file name like 1.txt -> 01.txt, 2.txt -> 02.txt....................... etc and after I used your suggested awk code and worked fine.

By the way thanks a million.

ADD REPLY
3
Entering edit mode
10.1 years ago
EagleEye 7.6k

This could be the simple way for as many as files:

awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' read_counts/*.gene_counts.htseq > merged.htseq
ADD COMMENT
1
Entering edit mode

This outputs: Merge htseq files into one based on geneNames. Sample output

ENSG00000251243.1    0    0    0    0    0
ENSG00000240637.2    0    0    0    0    0
ENSG00000227366.1    0    0    0    0    0
ENSG00000104903.4    665    216    884    3254    689
ENSG00000228626.1    0    0    0    0    1

Forgot to mention, the order of pasted column will be in FileName order.

Like if you do listing using unix command ls -l read_counts/

You will get the order in which it has merged.

ADD REPLY
0
Entering edit mode

Dear Santhilal,

Your code is working as mine too...but the main problem what I mentioned on question is.......ITS NOT IN SEQUENCE.

I NEED...

ID     C1 C1 C3.............UPTO 36

your code or mine work as terminal shows the directory by hit of ls

10.txt
11.txt
31.txt
12.txt
13.txt
4.txt
3.txt
1.txt
20.txt
...
....
....
36.txt

However, I tried by

for I in $(seq 1 36); do awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' $i.txt > merged.htseq; done

but its not working?

ADD REPLY
0
Entering edit mode

you can try this

ls -l -v | awk '{print $9}'

and then AWK. I don't know whether it will work. But you can try it.

ADD REPLY
0
Entering edit mode

Thanks but while using this only its work but not with advised final awk code.

I used...but not worked for me.

ls -l -v | awk '{print $9}' | awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' *.txt > merged.htseq

however, I have done by my way

Thanks anyway....very kind of you. :)

by the way

Wishing you a very happy and prosperous Diwali!!

ADD REPLY
0
Entering edit mode

Could please provide your working solution?

ADD REPLY
3
Entering edit mode
6.5 years ago
Arindam Ghosh ▴ 530

I managed to do this by R. The script is available here. It works for me. Please try and check if it works fine.

ADD COMMENT
1
Entering edit mode
10.1 years ago

I wrote this a while ago. It can also change the file labels.

Note that it's so easy to just do this in R that there's really little reason to manually merge files like this.

ADD COMMENT
1
Entering edit mode

and if you use DESeq2 you can simply give the file names in input ( see DESeq2 vignette 1.2.4 HTseq input )

ADD REPLY
1
Entering edit mode

Exactly, and even if you're using something like edgeR, it's simple enough to just use the same method employed by DESeq2 (namely, lapply() reading files and then lapply a function to get the counts out).

ADD REPLY
1
Entering edit mode

i prefer first by terminal or bash.

Thanks by the for your valuable answer.

ADD REPLY
0
Entering edit mode
8.3 years ago
bmahesh07 • 0

I found a useful R script (htseq-combine_all.R script) for merging HT-seq counts at the following resource http://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.4

ADD COMMENT
0
Entering edit mode

I tried the the above R script (htseq-combine_all.R script) but I ran into an error

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
5.8 years ago

Is it considered best practice to sum up reads across replicates or to run HTSeq on merged BAM File after ccombing fastqs. Thanks Adrian

ADD COMMENT
1
Entering edit mode

No, you should use hts-seq on each replicates otherwise there is no meaning of replicates. By the way, you could use "featureCounts tool" for each bam at a time. Then, further you don't need to merge as its give you a matrix of count for each sample of bam.

Hope this helps Good luck.

ADD REPLY
0
Entering edit mode
5.7 years ago
Ahmed Alhendi ▴ 240

This is easily done using R script! You can use my R script on https://github.com/AAlhendi1707/htseq-merge/blob/master/htseq-merge_all.R

For more details about how to generate and combine HTSeq outputs into one read count matrix in R https://ahmedalhendi0.wordpress.com/2019/03/20/combine-htseq-outputs-into-one-read-count-matrix-in-r/

ADD COMMENT
0
Entering edit mode

Hello, I followed your script and ran it on Linux, but I got an error message. My input files were in the same directory where I run R script, but it said cant read files. How can I fix it?

Thx .

ADD REPLY
0
Entering edit mode

Recheck on path and file format.

ADD REPLY
0
Entering edit mode

Hello Dr.Alhendi, I am totally new to this bioinformatics, so please bear with my stupidity here, if any. Thanks. I did check again. Here is how I used the Rscript. 1. I could not use it on Mac OS terminal. 2. I cant use it on Mac OS R. 3. I tried to use it on Linux cluster as follows.

`Rscript htseq-merge_all.R /kuacc/users/akashgari19/HTSeq  HTSeq_Merged.txt`

R script is located on the same directory. My HTSeq out files are located on this directory:

`/kuacc/users/akashgari19/HTSeq`

My files are named below : CON-1_HTSeq.txt treatment-1_HTSeq.txt ....... total 10 files.

`head CON-1_HTSeq.txt.                                                                     ENSMUSG00000000001   3141 
 ENSMUSG00000000003 0.                                                                  ENSMUSG00000000028  854

ENSMUSG00000000031 822918 ENSMUSG00000000037 1 ENSMUSG00000000049 3 ENSMUSG00000000056 3897 ENSMUSG00000000058 1399 ENSMUSG00000000078 1766 ENSMUSG00000000085 706` When I run the script I got the following errors. [1] "############################" [1] "## Files to be merged are: ##" "CON-1_HTSeq.txt" "CON-2_HTSeq.txt" ........

file read START ######### Aug_12_2020_13_18_31_+03" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/kuacc/users/akashgari19/HTSeq/CON-1_HTSeq.cnt': No such file or directory. Execution halted

Yes, I do not have files with .cnt extension.
Please help me, how can I solve this. Thanks.

ADD REPLY
0
Entering edit mode
4.5 years ago
Alex Nesmelov ▴ 200

I decided to revive the thread adding my short tidiverse-based solution in R. Assuming we are in R session in the directory with htseq-counts files:

library(tidyverse)
counts_files_pattern = "\\.counts\\.txt$"
counts_table <-
  list.files(pattern = counts_files_pattern) %>%
  #apply an import function to the listed files
  map(function(x) { 
        print(x)
        read_table2(x,
                #rename second column in accordance with the file name
                col_names = c("ID", x))
      }
      ) %>%
  #merge all tables into one
  reduce(left_join, by = "ID") %>% 
  # clean up column names
  rename_all(gsub,
             pattern = counts_files_pattern,
             replacement = '') %>%
  #move ID column to row names
  column_to_rownames("ID")
ADD COMMENT
0
Entering edit mode

Hello, I used this code on R , but got an error. I ran the script on the same directory I had my count files. Got the following error. Error: .x is empty, and no .init supplied.

Can you help me? Thanks.

ADD REPLY
0
Entering edit mode
3.2 years ago

This is the shortest code I can possibly write. It removes those feature descriptions of HTSeq counts and imports a neat and clean matrix from combining all the samples.

library(dplyr)
path <- "C:/Users/Rohit Satyam/Desktop/RNASeq workshop/backup/"
files <- list.files(path,".txt", full.names = T)
## making a list of data frames with row and column names while removing last five lines from HTSeq that contains the features and ambiguous read infromation
dfList <- lapply(files, function(x) {
  read.csv(x, nrows = length(count.fields(x))-5, sep = "\t", header = F, row.names = 1, col.names = c("genes",tools::file_path_sans_ext(basename(x))))})
## combining all the dataframes into single dataframe
matrix <- bind_cols(dfList)
ADD COMMENT

Login before adding your answer.

Traffic: 2197 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6