PSSM (Position Specific Scoring Matrix) from NCBI convert to matrix format
1
3
Entering edit mode
5.7 years ago
gaiboyan23 ▴ 30

When I run psi-blast (3 iterations) to download the PSSM (Position Specific Scoring Matrix), the files come up as ASN.1 files. How can i convert these ASN.1 files to matrices. The PSSMs should look like matrices with 20 columns (for 20 amino acids) and L rows (L=length of protein)

However, the files I download are ASN.1 files, which look like this:

numRows 28,         
numColumns 555,         
swissprot {accession "q2m32 9"}             
inst {          
repr raw,           
mol aa,         
length 555                  
{ 0, 10, 0 },           
{ 634267618256118,  10, -16 },
{ 0, 10, 0 },           
{ 129176519709001,  10, -16 },
{ 182041473572035,  10, -16 },
{ 292541684289281,  10, -16 }
.....

How can I download/convert this ASN.1 format to what a PSSM matrix is supposed to look like?

alignment sequence • 5.0k views
ADD COMMENT
1
Entering edit mode
4.4 years ago
bjwiley23 ▴ 40

I have been looking for a resource to indicate what the 28 rows align with so I can programattically do this but have not found a resource yet. In the mean time you can upload your Scoremat.asn file to the pssm_viewer, choose the file, then click 'matrix view', and 'download matrix to file'.

EDIT: I got a response from Christiam C. at NCBI. The 28 rows are based off the order of the array NCBISTDAA_TO_AMINOACID (see definition below): https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_encoding.c#L115

I confirmed the order with a PSI-blast.

ADD COMMENT
0
Entering edit mode

Helloļ¼Do you have a script for batch conversion of Scoremat.asn to PSSM matrix format? Thank you for taking the time to answer me.

ADD REPLY
0
Entering edit mode

Hi, can you help me? I'm at my wit's end. Could you show me an example of Python parsing? Thank you very much !!!!!!

ADD REPLY
0
Entering edit mode

Hi @leeiqi,

I don't think I have the python script anymore. To do this you really just need to see the order of the amino acids in https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_encoding.c#0115 (see line 115) in which there are 28 I believe and have that amino acid order. So you need to parse out all the numbers using a regex matcher between

finalData {
      scores {

and

},
      lambda { 0, 10, 0 },

It should have a length that is a multiple of 28 and your amino acid length. So its a 1-d vector of (28 x n) and then you need to reshape it with python numpy so that it is 2-d as in (n x 28) where you have n rows and 28 columns. Then only keep the amino acids you need using list comprehension. Like you don't need '-','Z','U','*','O', 'J'... So those columns you would remove by their number and you have the PSSM!

ADD REPLY

Login before adding your answer.

Traffic: 1331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6