Hi all,
I am a molecular biologist in the first place, but came into touch with bioinformatic challenges more often recently. For simple tasks I've been using command line blast+ already and now I am wondering whether it is possible to do the following with blast+:
Let's say I have a random protein sequence in a file named RP1.faa and I want to check each bacterial strain in the refseq nr database on whether RP1 ist present or absent by an e-value threshold of e.g. 1e-05. And I want a table which shows me only the best hit for each strain in the first column and e.g. e-value, bit-score and so on in the following columns. And in each row I want the next strain to be shown, even if no blast hit was found In the end I am imagining a matrix which shows me whether a certain protein is present or absent.
Do you know if it is possible? Thanks a lot!
You can show the presence of each straing by, doing a tab output with
-outfmt 6
, and provide the e-value threshold to the blast command, as wells as use-max_target_seqs 1
to show the best alignment for each strain. What you will have to do is, conclude how against many/which strains RP1 is absent. Someone may have a better approach.In addition you will have to filter/restrict the results for bacteria (in theory, taxid:2).
I am trying to obtain a similar output but using the rpsblast function in the command line. Do you know if there is a way to not only display just the best hit but also only "specific hits" as is given as a result in the web-based version?
Hey, thanks for the fast answers. But unfortunately it is not yet what I was looking for. When using
-max_target_seqs 1
I'll get only one hit in total, not one hit for every entry in my database.Is the reference to
nr
or to your own sequence data? You can always flip the search around. It is not necessary that there will be a hit found for every sequence being searched in both of those cases.For now it still refers to the "nr" database. I'm am currently trying to subset it into a more precise and useful database