Redirecting pHMMER output to /dev/null
2
0
Entering edit mode
5.6 years ago
Morgan S. ▴ 90

Hey guys,

Something that should be so simple has been so difficult for me to resolve the past few weeks. I am using pHMMER to search my ~12,000 fungal gene predictions against the MEROPS database. The issue is that for some of my gene queries, over a 1,000 hits will be returned, making it impossible to sort through all queries for their top hit. There are so many hits that not all of them can fit in one excel spreadsheet.

**Edit Here is a subset of what my table looks like. As you scroll down you can see that for just the first gene there are almost 1,000 hits.

Well after contacting the developer, he suggested redirecting the main output to /dev/null so that only the top hit of each query remains. He said the script should look like this

phmmer --tblout 1371E_merops5.tbl /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null; head -4 1371E_merops5.tbl

However, this still does not work, and the developer says it has to do with the command line, not the program. Does anyone here have experience with this???

Thanks in advance! Morgan

annotation protein HMMER gene genome • 3.1k views
ADD COMMENT
0
Entering edit mode

What exactly are you left with? What does your *.tbl file look like before and after head -4?

Its not clear if you're outputting the result to the .tbl or to the STDOUT (or both)?

ADD REPLY
0
Entering edit mode

I am just directing the output to .tbl only using --tblout

Before

After

ADD REPLY
0
Entering edit mode

Has anyone solved this issue. I am also getting multiple hits for the same id instead of directing for a single best hit. Now I have results in tbl format, want to keep only the best/top hit and delete other hits from the file. Please share your suggestions/script for the same.

ADD REPLY
2
Entering edit mode
5.6 years ago

You're only redirecting STDOUT, you also need to redirect STDERR. For POSIX-compatible shells, use:

phmmer ... >/dev/null 2>&1
ADD COMMENT
0
Entering edit mode

So would it look this? i.e do I have the 2>&1 in the correct position? Do I keep or remove the ; after null? (/dev/null; ....)

phmmer --tblout 1371E_merops5.tbl /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null 2>&1 head -4 1371E_merops5.tbl
ADD REPLY
2
Entering edit mode

Remove your head operation completely. That is independent of the output of phmmer in this case. It creates the file you need, so we're only interested in discarding the STDOUT.

It is typically placed at the end of the command, you can satisfy yourself about how STDERR and STDOUT etc work like so:

$ ( echo "I'm stderr" > /dev/stderr ) > /dev/null
I'm stderr
$ ( echo "I'm stderr" > /dev/stderr ) > /dev/null 2>&1

#  no output 
$ ( echo "I'm stdout" > /dev/stdout ) > /dev/null

# also no output

&1, and &2 are what is known as file descriptors, and are essentially shorthand for /dev/stdout and /dev/stderr. Since the redirection operator > by default only redirects the STDOUT, you need to tell it to also discard STDERR. The syntax 2>&1 basically says, "take the output thats coming from 2 and send it in to 1. Then when you redirect with >, the STDERR is there too.

Another way of illustrating this is to leave the STDOUT alone, and just redirect the STDERR, which you could do like so:

$ ( echo "I'm stderr" > /dev/stderr ) 2> /dev/null

# empty output

Notice the 2 before the > this time.

ADD REPLY
0
Entering edit mode

Thanks for your help, I understand now what you are saying about the differences in stdout and stderr but how does this script only give me the top hit for each of my queries? Wouldn't I need to add something like head -4, to tell it to throw everything out of the stderr that isn't the top hit?

When I used

/dev/null 2>&1

I still have over a 1,000 hits for many of my genes, picture of partial output Here is the exact script I used as well.

#!/bin/bash
#SBATCH -t 96:00:00 -o 1371E_phmmer6.out -J F1hmm
module load hmmer/3.1b2
phmmer -E 0.001 --tblout 1371E_merops6.tbl --cpu 20 /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null 2>&1
ADD REPLY
1
Entering edit mode

You're confusing the output of phmmer with your results file.

If I'm understanding correctly, you just want to keep the top hit from the .tbl file? In which case, you can choose to do whatever you like with the STDOUT from phmmer because it has no bearing on that file.

You just need to use head -n 1371E_merops6.tbl where n is how many lines/hits you want. You can do this as a separate command, or follow the phmmer call with the semicolon as you were doing before. The 2 commands are entirely independent of one another, you're just using the output file of one, as the input file to the other.

ADD REPLY
0
Entering edit mode

Yes, I thought that the .tbl and the output were the same file since I was using --tblout .

That is correct, I wanted the results file to only include the top hit from each gene. When I was contacting the developer, what he said/what I understood was to discard the main output to /dev/null, save a tabular output file instead, and take the first hit from the tabular output (first 4 lines, including 3 header lines) by doing

% `(phmmer --tblout out.tbl my.query my.db > /dev/null; head -4 out.tbl)`

so then the results .tbl file would only include the top hit for each gene.

So what you are saying I need to do is

phmmer --tblout out.tbl my.query my.db > /dev/null 2&>1; head -4 out.tbl

?

ADD REPLY
1
Entering edit mode

Yes that should work (guessing entirely off the way UNIX tools typically work though, not recent experience with HMMer).

When you use the ; syntax, those 2 commands are executed one after another, as if you typed them seperately in to the commandlind: command1 -some arg ; command2 -some -other args. In this case command1 and command2 are entirely independent, they're just called on the same line.

Your tbl file should remain the same whether you discard the STDOUT to /dev/null or not (I would expect), so you only need to redirect it if you don't want to see screen output. The file will be created all the same either way.

ADD REPLY
0
Entering edit mode

Unfortunately, it did not work. My tbl output still looks the same. :/ I am kind of at a lost...

ADD REPLY
1
Entering edit mode

Can you edit your question with what your output data looks like and what you want it to look like? I still don't really understand in what way the tbl file is incorrect.

ADD REPLY
0
Entering edit mode

I edited my original question with what the tbl file looks like in Excel. I want one hit for each gene, but instead I am getting several 100 to 1000 hits for each query.

ADD REPLY
1
Entering edit mode

Where did you get head -4 from? Your problem isn't with phmmer's output, you just need to filter the tbl file.

The /dev/null business has nothing to do with the .tbl file as I mentioned before, that purely affects what's printed to the screen.

If you want just the top hit from that file for instance, then you need only use head -n 2 | tail -n 1. If you want the top hit on a per-query basis, that requires some more heavy duty scripting.

ADD REPLY
0
Entering edit mode

The developer of hmmer recommended that I use head -4 in conjunction with /dev/null. Unfortunately, what I need is just the top hit per-query basis and I don't have the skills to create a script to do this. I wonder why hmmer does not have the option to output just one hit per a query like blast allows??

Well I thank you for your help, and I think I will just switch to blast to accomplish this task. Thanks!

ADD REPLY
2
Entering edit mode

A script to pull out the data you need wouldn't be very difficult to write (and solutions you could use for this definitely exist on this and other foums) and would be a good learning exercise, I would encourage you to try it...

ADD REPLY
0
Entering edit mode

I don't have much experience in using programming languages to create scripts, but I will see what I can find. Thanks for the advice!

ADD REPLY
0
Entering edit mode

You just need to break it down in to its 'component' conceptual tasks.

The task boils down to:

  • Loop over each row of the file
  • Check if that hit/gene ID has been seen before
    • If yes: output that line
    • If no: keep going

It's pretty much that simple.

(Assuming its already sorted by ID, and then by hit score).

ADD REPLY
0
Entering edit mode
21 months ago
Md ▴ 10

After running this command

phmmer -E 0.001 --tblout 1371E_merops5.tbl  query.faa pepunit.lib > /dev/null 2>&1

I sorted using awk command to get unique or single best hit

awk '!x[$3]++' 1371E_merops5.tbl > 1371E_merops5.txt

or if u want to filter out using e value

awk '$5>0.001{print $0}'  1371E_merops5.txt >1371E_merops5_sorted.txt

or if u want to filter out using e value >0.001 and score with 25 or higher

awk  '{ if(($5 <0.001) && ($6 >25)) { print } }' merops5.tbl >merops5.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6