Extracting specific hits from blast output. How to automate this process for multiple sequences
1
0
Entering edit mode
5.7 years ago

I am trying to get specific hits from a blast output. The following are the steps that I performed (this was done on web blast):

  1. Paste aa input sequence to blast and run it with default parameters on nr database
  2. Choose the new blast result page so that I can set some filters
  3. Set filter- identity from 50%-90%
  4. Select sequences with Query coverage of 70%
  5. Download all filtered hits in fasta/csv format.

So now, I have all blast hits for the input sequence in fasta format. However, I have more than 100 different sequences and would like to perform the same steps on each of them.

I could use blast+ but I am not sure if I can set the filters mentioned above in the commandline for blastp. I could not find these filters in >blastp -help. I also checked if Biopython can do this for me but I could not find useful links. Also could not find useful link in biostars or biology.stackexchange.com. The most related one that I found was here.

I am thinking that, after I get the output, I would need to have some script that will do this for me. I would really appreciate if someone can guide me in that right direction. Please forgive me if the question is not comprehensible. I can explain further if required. Thank you.


Following is a sample of the sequences that I have:

>sp|Q03155|AIDA_ECOLX AIDA-I autotransporter OS=Escherichia coli OX=562 GN=aidA-I PE=1 SV=1
MNKAYSIIWSHSRQAWIVASELARGHGFVLAKNTLLVLAVVSTIGNAFAVNISGTVSSGG
TVSSGETQIVYSGRGNSNATVNSGGTQIVNNGGKTTATTVNSSGSQNVGTSGATISTIVN
SGGIQRVSSGGVASATNLSGGAQNIYNLGHASNTVIFSGGNQTIFSGGITDSTNISSGGQ
QRVSSGGVASNTTINSSGAQNILSEEGAISTHISSGGNQYISAGANATETIVNSGGFQRV
NSGAVATGTVLSGGTQNVSSGGSAISTSVYNSGVQTVFAGATVTDTTVNSGGNQNISSGG
IVSETTVNVSGTQNIYSGGSALSANIKGSQIVNSEGTAINTLVSDGGYQHIRNGGIASGT
IVNQSGYVNISSGGYAESTIINSGGTLRVLSDGYARGTILNNSGRENVSNGGVSYNAMIN
TGGNQYIYSDGEATAAIVNTSGFQRINSGGTAPVQNSVVVTRTVSSAAKPFDAEVYSGGK
QTVYLWRGIWYSNFLTAVWSMFPGTASGANVNLSGRLNAFAGNVVGTILNQEGRQYVYSG
ATATSTVGNNEGREYVLSGGITDGTVLNSGGLQAVSSGGKASATVINEGGAQFVYDGGQV
TGTNIKNGGTIRVDSGASALNIALSSGGNLFTSTGATLPELTTMAALSVSQNHASNIVLE
NGGLLRVTSGGTATDTTVNSAGRLRIDDGGTINGTTTINADGIVAGTNIQNDGNFILNLA
ENYDFETELSGSGVLVKDNTGIMTYAGTLTQAQGVNVKNGGIIFDSAVVNADMAVNQNAY
INISDQATINGSVNNNGSIVINNSIINGNITNDADLSFGTAKLLSATVNGSLVNNKNIIL
NPTKESAGNTLTVSNYTGTPGSVISLGGVLEGDNSLTDRLVVKGNTSGQSDIVYVNEDGS
GGQTRDGINIISVEGNSDAEFSLKNRVVAGAYDYTLQKGNESGTDNKGWYLTSHLPTSDT
RQYRPENGSYATNMALANSLFLMDLNERKQFRAMSDNTQPESASVWMKITGGISSGKLND
GQNKTTTNQFINQLGGDIYKFHAEQLGDFTLGIMGGYANAKGKTINYTSNKAARNTLDGY
SVGVYGTWYQNGENATGLFAETWMQYNWFNASVKGDGLEEEKYNLNGLTASAGGGYNLNV
HTWTSPEGITGEFWLQPHLQAVWMGVTPDTHQEDNGTVVQGAGKNNIQTKAGIRASWKVK
STLDKDTGRRFRPYIEANWIHNTHEFGVKMSDDSQLLSGSRNQGEIKTGIEGVITQNLSV
NGGVAYQAGGHGSNAISGALGIKYSF

>sp|P86223|VDAC2_MESAU Voltage-dependent anion-selective channel protein 2 (Fragments) OS=Mesocricetus auratus OX=10036 GN=VDAC2 PE=1 SV=1
DIFNKGFGFGLVKYKWCEYGLTFTEKLTFDTTFSPNTGKKSNFAVGYRTGDFQLHTNVNN
GTEFGGSIYQKVCEDFDTSVNLAWTSGTNCTRVNNSSLIGVGYTQTLRPGVKLTLSALVD
GK

>sp|P64744|SMASE_MYCBO Sphingomyelinase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=BQ2027_MB0912 PE=3 SV=1
MDYAKRIGQVGALAVVLGVGAAVTTHAIGSAAPTDPSSSSTDSPVDACSPLGGSASSLAA
IPGASVPQVGVRQVDPGSIPDDLLNALIDFLAAVRNGLVPIIENRTPVANPQQVSVPEGG
TVGPVRFDACDPDGNRMTFAVRERGAPGGPQHGIVTVDQRTASFIYTADPGFVGTDTFSV
NVSDDTSLHVHGLAGYLGPFHGHDDVATVTVFVGNTPTDTISGDFSMLTYNIAGLPFPLS
SAILPRFFYTKEIGKRLNAYYVANVQEDFAYHQFLIKKSKMPSQTPPEPPTLLWPIGVPF
SDGLNTLSEFKVQRLDRQTWYECTSDNCLTLKGFTYSQMRLPGGDTVDVYNLHTNTGGGP
TTNANLAQVANYIQQNSAGRAVIVTGDFNARYSDDQSALLQFAQVNGLTDAWVQVEHGPT
TPPFAPTCMVGNECELLDKIFYRSGQGVTLQAVSYGNEAPKFFNSKGEPLSDHSPAVVGF
HYVADNVAVR

This is a portion of the output that I have for the first sequenceblastp output for UniprotID Q03155

blast blastp blastoutput blasthits querycoverage • 2.7k views
ADD COMMENT
1
Entering edit mode

The most related one that I found was here.

The URL of "here" is linking to your blast output example picture.

ADD REPLY
2
Entering edit mode
5.6 years ago
AK ★ 2.2k

Perhaps you could add an extra field (qcovs) in the tabular output format and then filter the results:

blastp -db <database_name> -query <input_file> \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs" \
  > BLAST.out

The results can be filtered by (assuming you meant "no less than" 70%):

$ awk '$3>=50 && $3<=90 && $13>=70 {print $0}' BLAST.out | head
sp|Q03155|AIDA_ECOLX    gi|1001813006|ref|WP_061342166.1|   88.423  1287    148 1   1   1286    1   1287    0.0 2191    100
sp|Q03155|AIDA_ECOLX    gi|835829037|ref|WP_047671677.1|    88.423  1287    148 1   1   1286    1   1287    0.0 2191    100
sp|Q03155|AIDA_ECOLX    gi|742917718|ref|WP_039023028.1|    87.335  1287    162 1   1   1286    1   1287    0.0 2167    100
sp|Q03155|AIDA_ECOLX    gi|1423315803|ref|WP_112015431.1|   87.257  1287    163 1   1   1286    1   1287    0.0 2164    100
sp|Q03155|AIDA_ECOLX    gi|928847386|ref|WP_053897362.1|    86.946  1287    167 1   1   1286    1   1287    0.0 2159    100
sp|Q03155|AIDA_ECOLX    gi|1257025276|ref|WP_097478690.1|   86.558  1287    172 1   1   1286    1   1287    0.0 2125    100
sp|Q03155|AIDA_ECOLX    gi|1253042654|ref|WP_096924401.1|   86.402  1287    174 1   1   1286    1   1287    0.0 2124    100
sp|Q03155|AIDA_ECOLX    gi|934387324|emb|CTY45944.1|    85.315  1287    188 1   1   1286    1   1287    0.0 2105    100
sp|Q03155|AIDA_ECOLX    gi|1423661614|ref|WP_112076308.1|   76.318  1309    264 8   1   1286    1   1286    0.0 1856    100
sp|Q03155|AIDA_ECOLX    gi|1245792053|ref|WP_096100225.1|   76.318  1309    264 8   1   1286    1   1286    0.0 1856    100
ADD COMMENT
0
Entering edit mode

Perfect, this is what I wanted. Yes, above 70% coverage. Thank you very much SMK

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2306 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6