Entering edit mode
7.9 years ago
rubic
▴
270
Hi,
Does anyone know of an R package that parses the pairwise format output of NCBI's blasp - which looks like this:
BLASTP 2.4.0+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.
Database: mm10 ensembl AA
61,440 sequences; 26,382,989 total letters
Query= NP_001254519.1 mu-type opioid receptor [Heterocephalus glaber]
Length=400
Score E Sequences producing significant alignments: (Bits) Value
ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1 753 0.0 ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1 753 0.0 ENSMUSP00000060590 ENSMUST00000056385;ENSMUSG00000000766;Oprm1 753 0.0 ENSMUSP00000077704 ENSMUST00000078634;ENSMUSG00000000766;Oprm1 731 0.0
. . .
> ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1 Length=398
Score = 753 bits (1943), Expect = 0.0, Method: Compositional matrix adjust. Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)
Query 1 MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP 60
MDSS PGN +C+DP A +SCS P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct 1 MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP 58
Query 61 PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT 120
TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct 59 QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT 118
Query 121 STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF 180
STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct 119 STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF 178
Query 181 RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI 240
RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct 179 RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI 238
Query 241 FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI 300
FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct 239 FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI 298
Query 301 YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI 360
YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct 299 YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI 358
Query 361 EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP 400
EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct 359 EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP 398
> ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1 Length=398
Score = 753 bits (1943), Expect = 0.0, Method: Compositional matrix adjust. Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)
Query 1 MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP 60
MDSS PGN +C+DP A +SCS P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct 1 MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP 58
Query 61 PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT 120
TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct 59 QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT 118
Query 121 STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF 180
STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct 119 STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF 178
Query 181 RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI 240
RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct 179 RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI 238
Query 241 FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI 300
FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct 239 FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI 298
Query 301 YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI 360
YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct 299 YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI 358
Query 361 EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP 400
EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct 359 EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP 398
What I actually need is to find all 100% matches between my query and database sequences (their sequence IDs and the corresponding coordinates).
If you know R, you're better off re-doing your blast in tab-delimited format (-outfmt 6), and parsing your results based on the parameters you want from there.