I am doing a BLAST search on a very repetitive transposon sequence on the maize B73 genome and need to report the top 50 alignments. I get back >100,000 alignments because each chromosome is considered one accession and all of the alignments on that chromosome are returned. I am using the BLAST command max_hsps to limit the number of alignments returned but I need a script or command to parse those results so that I can report only the top 50 alignments, sorted by bit score not by chromosome. Has anyone done this before? I need the actual alignments sorted not the results table.
I need the actual alignments sorted not the results table
The results and the alignments are one and the same so it's really unclear how you plan to only sort the alignments without parsing the alignment results table, especially if you want to sort by bit score.
There are tools/packages available for working with standard blast-formated output, for example BioPerl and BioPython.
You also have options for customizing the output of blast into tab-delimited results (see Blast - Formatting Output for some examples) that may make it easier to parse.
If you don't want to bring in another entire module, I have a helper script called blast_to_df.py that formats all the results available here. General use the two main ways of using it (as a script proper on the command line or as main an imported function) are illustrated thoroughly in a series of Jupyter notebooks that you can launch in the browser with no installation by going here and pressing 'launch binder'. When the session spins up, a list of available notebooks will come up. You probably could probably skip to the second notebook as the first is more a general overview and you seem familiar with the various issues. Sorting is demonstrated in those, too.
Because is a simpler, concise script, it may be easily adapted to meet your needs if it isn't covering it.
It is highly not recommended to use max_hsps to reduce the number of matches since BLAST will stop searching after reaching this limit. It will not return the best results; see this paper.Edit: That was wrong, I've to step back. That was related to max_target_seqs .Sorry.
If you are using Linux, you can sort the table with sort. First properly, it makes sense to use an easier readable format with BLAST, such as:
That's sorts according to the first column (accession), then the percentage identity, and then the highest bitscore (and remove all matches for the same accession with a worse match).
Sorry if I wasn't clear. I need both alignments and resort table sorted. Thanks for your response. I will check these out.
If you don't want to bring in another entire module, I have a helper script called
blast_to_df.py
that formats all the results available here. General use the two main ways of using it (as a script proper on the command line or as main an imported function) are illustrated thoroughly in a series of Jupyter notebooks that you can launch in the browser with no installation by going here and pressing 'launch binder'. When the session spins up, a list of available notebooks will come up. You probably could probably skip to the second notebook as the first is more a general overview and you seem familiar with the various issues. Sorting is demonstrated in those, too.Because is a simpler, concise script, it may be easily adapted to meet your needs if it isn't covering it.