Filteration of 0 counts in all column of the htseq-count metrics and add pseudo counts in remaining columns after filteration ??
2
0
Entering edit mode
10.1 years ago

Dear all,

I would like filter out the row which have correspondence 0 counts in all samples of my count metrics. Then also would like to add some integers (pseudo counts) in remaining columns. This is the counts from htseq and II merged into big count metrics. The main problem is that I have about 36 columns+ 1 id and I cant wrap this code what i have done for 6 samples for my another project.

cat merged.htseq | \
  awk -F "\t" '{
    if ($2>0 || $3>0 || $4>0 || $5>0 || $6>0 || $7>0 )
      print $1"\t"$2+1"\t" $3+1"\t"$4+1"\t"$5+1"\t"$6+1"\t"$7+1"\t" 
  }' > final_matrix_nonzero_1pseudoCounts.txt

for example:

id          c1   c2   c3   t1  t2  t3
gene1     0    0     1     0   0   1
gene2     0   0     0     0    0   0  #(should be removed; gene 2 rows; because all columns have 0 in all sample)
gene3     1   1     23   45   5   0

then add 1 in remaining matrix (final matrix)

     id          c1   c2   c3   t1  t2  t3
gene1          1    1     2     1   1   2
gene3          2   2     24   46   6   1

Thanks for help

htseq-counts RNA-seq • 3.8k views
ADD COMMENT
1
Entering edit mode
10.1 years ago

In R it's pretty simple:

A <- A[rowSums(A)>0]
A <- A+1

And if you want to use it from terminal, save this script in script.R and use Rscript. It should work

args <- commandArgs(trailingOnly = TRUE)
A <- read.table(args[1],sep="\t",as.is=T)
A <- A[rowSums(A)>0]
A <- A+1
write.table(A,args[2],sep="\t")
Rscript script.R in.txt out.txt
ADD COMMENT
0
Entering edit mode

Thank you so much but II have ID in my column and getting error:

Error in rowSums(A) : 'x' must be numeric
ADD REPLY
0
Entering edit mode

maybe try

A <- read.table(args[1],sep="\t",as.is=T,row.names=1,header=T)
ADD REPLY
0
Entering edit mode
10.1 years ago
george.ry ★ 1.2k

I've actually been doing a similar thing recently myself (without the pseudo-counts, anyway). This is what I ended up with as a shell script - now tweaked a bit for your pseudocounts:

cat *.counts \
| grep -v "^_" \
| datamash -s -g 1 sum 2 \
| awk '{if ($2>0) print $1}' \
> tmp

for f in $(ls *.counts); do
    f_="${f%.counts}"
    cat "$f" \
    | LC_ALL=C grep -F -f tmp \
    | awk '{print $1"\t"$2+1}' \
    > "$f_".pseudo.counts
done

rm tmp

This works with the original count files from htseq-count rather than the table you've setup but you can hack it it you want. First part sets up a tmp file of the gene/transcript IDs that have expression in one or more of the samples. You then grep for that list within the files and feed it through awk for your pseudocounts. LC_ALL=C converts the query from UTF-8 to ASCII, which is many times faster for large numbers of queries in -f.

Only thing you might not have there is datamash, but you should have it in your life anyway, so download it :p

Should add that this was from memory, as I'm not on my workstation, so check it works!

ADD COMMENT
0
Entering edit mode

Thank you very much for your kind reply and help...but sorry I don't have datamash, however, I tried to install it but needed sudo password what I don't have right now, may be I will ask to someone but late, therefore could you help me without using datamash? Moreover, yes my table matrix is indeed original counts by htseq-counts what I merged my all 36 samples into one big matrix and already removed last 5 rows. The table shows above is just a fake example in order to understand my query.

Thanks once again....

ADD REPLY
0
Entering edit mode

The (relative) simplicity of that really is dependent on datamash. Having datamash around has removed about 90% of the little Python scripts I used to write for this sort of thing. Perhaps just download and build from source and get the sudo password later. You can just skip the sudo make install and use a full pathname to the datamash program within the directory where you built it.

ADD REPLY
0
Entering edit mode

Dear george.ry,

Your advised first code is working file...i have given my merged.matrix in place of *.counts and extracted list of genes into tmp.

cat all.mer | grep -v "^_" | datamash -g 1 sum 2 | awk '{if ($2>0) print $1}' > tmp

but rest part is not working ??? syntax error

for f in $(ls *.mer)
do
    f_="${f%.mer}"
    cat "$f"| LC_ALL=C grep -F -f tmp | awk '{if ($2>0) print $1"\t"$2 else print $1"\t"$2+1}' > "$f_".pseudo.counts
done

needed your further help, please ??

Indeed I also tired with htseq native format...and again got syntax error

cat *.txt | grep -v "^_" | datamash -g 1 sum 2 | awk '{if ($2>0) print $1}' > tmp

but rest part is not working ??? syntax error

for f in $(ls *.txt)
do
    f_="${f%.mer}"
    cat "$f"| LC_ALL=C grep -F -f tmp | awk '{if ($2>0) print $1"\t"$2 else print $1"\t"$2+1}' > "$f_".pseudo.counts
done

Needed your further help, please

ADD REPLY
0
Entering edit mode

I did say I was doing it from memory and to check! I'm on a Windows laptop here! Anyway, looking over it again, I missed the semi-colons in the awk if-else statement. It should be:

awk '{if ($2>0) print $1"\t"$2; else print $1"\t"$2+1;}'

As I said, that's designed to work on the original files from htseq-count. You would need to hack it substantially to make it work with your tabular format.

ADD REPLY
0
Entering edit mode

Thank you its work fine but I need to add plus 1 in all columns either they have 0 or whatever numbers . Hope you understand what I mean? Please sir look my example gene1 and gene3 extend the number while adding 1.

for exp:

id          c1   c2   c3   t1  t2  t3
gene1     0    0     1     0   0   1
gene2     0   0     0     0    0   0        (should be removed; gene 2 rows; bcz all columns have 0 in all sample)
gene3     1   1     23   45   5   0

then add 1 in remaining matrix (final matrix)

     id          c1   c2   c3   t1  t2  t3
gene1          1    1     2     1   1   2
gene3          2   2     24   46   6   1

like what I have done here....highlighted

cat merged.htseq | awk -F "\t" '{if ($2>0 || $3>0 || $4>0 || $5>0 || $6>0 || $7>0 )print $1"\t"$2+1"\t" $3+1"\t"$4+1"\t"$5+1"\t"$6+1"\t"$7+1"\t" }' > final_matrix_nonzero_1pseudoCounts.txt
                                                                                   ^___________________________________________________________^
ADD REPLY
0
Entering edit mode

Sorry I assumed you'd written the awk command you posted originally, so were familiar with awk. That makes the if-else irrelevant, so just use a simple print statement.

I've updated the code in my original post, as I also added a sort step into the datamash command. Copy the entire thing and replace what you have.

ADD REPLY
0
Entering edit mode

Hi george.ry

I tried again but its not working well. I checked manually and found that code left some 0's counts and pick to add pseudo counts.

ADD REPLY

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6