No I don't have the gene nemae in the fasta headers. I don't think anyway.
I use the blastall -p blastx command to run reads against a protein-sequence-database I made with formatdb. So the BLAST output file I get gives me all the hits for each protein which lists the protein name in the header along with gene name as, for example, GN=mecA.
So the perl script I am incapable of writing myself, within this year anyway, sorts the hits like this
Thank you very much for actually writing a ruby script!
Unfortunately it does give me a null pointer error message in line 9: "undefined method `[]' for nil:NilClass (NoMethodError)". This has probably to do with the file I get after the perl script, rather than your code. Unfortunately, I don't know how to fix it. But many thanks anyway!
You are welcome. The script will work for data that is formatted as in my example (SW:X2:, etc). If you have additional lines on top of your input, or at the very end (for example, a newline at the end of the input), then the scripts breaks. It might be good to post an example of your input next time, which you can format correctly by putting it in a separate paragraph and indenting each line by four spaces.
You should write a test case. Write an example of input file, and an example of the output that you want to get, and then post it here.
Do you have the gene name in the fasta headers or a mapping between the identifiers in fasta to gene name ?
No I don't have the gene nemae in the fasta headers. I don't think anyway.
I use the blastall -p blastx command to run reads against a protein-sequence-database I made with formatdb. So the BLAST output file I get gives me all the hits for each protein which lists the protein name in the header along with gene name as, for example, GN=mecA.
So the perl script I am incapable of writing myself, within this year anyway, sorts the hits like this
tr|H6LSH4|H6LSH4_STAAU:
but it would be better to have them sorted according to GN=mecA or similar.
I just wondered if anyone knew if there was a script that did that.
:)
I agree with Giovanni's comment, because then it is easy to figure out what you need