Parsing BLAST pairwise output in R
0
0
Entering edit mode
7.9 years ago
rubic ▴ 270

Hi,

Does anyone know of an R package that parses the pairwise format output of NCBI's blasp - which looks like this:

BLASTP 2.4.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.



Database: mm10 ensembl AA
           61,440 sequences; 26,382,989 total letters



Query= NP_001254519.1 mu-type opioid receptor [Heterocephalus glaber]

Length=400
                                                                      Score     E Sequences producing significant alignments:                (Bits)  Value

  ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000060590 ENSMUST00000056385;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000077704 ENSMUST00000078634;ENSMUSG00000000766;Oprm1      731     0.0   

. . .

> ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1 Length=398

 Score = 753 bits (1943),  Expect = 0.0, Method: Compositional matrix adjust.  Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)

Query  1    MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP  60
            MDSS  PGN  +C+DP A +SCS  P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct  1    MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP  58

Query  61   PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT  120
             TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct  59  QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT  118

Query  121  STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  180
            STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct  119  STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  178

Query  181  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI  240
            RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct  179  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI  238

Query  241  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  300
            FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct  239  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  298

Query  301  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  360
            YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct  299  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  358

Query  361  EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP  400
            EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct  359  EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP  398


> ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1 Length=398

 Score = 753 bits (1943),  Expect = 0.0, Method: Compositional matrix adjust.  Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)

Query  1    MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP  60
            MDSS  PGN  +C+DP A +SCS  P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct  1    MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP  58

Query  61   PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT  120
             TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct  59  QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT  118

Query  121  STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  180
            STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct  119  STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  178

Query  181  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI  240
            RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct  179  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI  238

Query  241  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  300
            FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct  239  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  298

Query  301  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  360
            YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct  299  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  358

Query  361  EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP  400
            EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct  359  EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP  398

What I actually need is to find all 100% matches between my query and database sequences (their sequence IDs and the corresponding coordinates).

R blast parse • 3.5k views
ADD COMMENT
2
Entering edit mode

If you know R, you're better off re-doing your blast in tab-delimited format (-outfmt 6), and parsing your results based on the parameters you want from there.

ADD REPLY

Login before adding your answer.

Traffic: 2316 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6