Question

How to keep the top hits only in the output file of hmmscan?

1

Entering edit mode

4.8 years ago

A_heath ▴ 170

Hi all,

I recently downloaded HMMER to use hmmscan locally in Linux with a Pfam database. It works great, however the ouput files are quite difficult to read quickly in my opinion...

I tried using output options such as: --tblout, --domtblout, --pfamtblout, etc. but the ouput files are still voluminous.

I would like to keep only the top hits in my output files.

I've seen that it was possible with hmmsearch so I was wondering if there was something similar with hmmscan... Ideally, I would want an output as I could find online: cf. here

If you have any suggestions, I'll gladly took them. Thank you in advance for your very appreciated help!

hmmer hmmscan • 4.3k views

ADD COMMENT • link updated 6 months ago by SergFly ▴ 50 • written 4.8 years ago by A_heath ▴ 170

score 4 · Accepted Answer · 2020-08-06

4

Entering edit mode

4.8 years ago

A_heath ▴ 170

For anyone interested, I figured it out using this amazing resource: http://slhogle.github.io/2015/remove-duplicate-lines/ and the option --tblout of hmmscan.

I did:

hmmscan --tblout output_file.pfam Pfam-A.hmm seq_file.fasta

and then:

awk '!x[$3]++' ouput_file.pfam > MYBESTHITS.pfam

MYBESTHITS.pfam file is basically what I got online with the top hit for each protein sequences.

ADD COMMENT • link 4.8 years ago by A_heath ▴ 170

3

Entering edit mode

Hey! You mentioned you know how to find the top hits from hmmscan. Could you share how you do this?

ADD REPLY • link 4.0 years ago by niamhlacyroberts ▴ 30

2

Entering edit mode

I'll leave a clarification. Since the link doesn't work.

Because hmmscan --tblout generates a table with the group of hits at the top with the best score, awk just leaves the first top one.

ADD REPLY • link 6 months ago by SergFly ▴ 50