Question

DCA/Finding canonical interacting AA residues on homo-oligomeric protein subunits

1

Entering edit mode

4.6 years ago

paula ▴ 30

I am working on creating a DCA pipeline for my lab. I'm currently evaluating the performance of several different DCA tools, EVCouplings, GREMLIN, and pydca. I am trying to find the False Positive rate for the predicted chain1-chain2 contact map produced by each of these tools, but my problem is finding the "canonical" true positives for the control protein I'm using. I'm using the listed contacts from PDB sum, but they are very different from what I get from any of the DCA tools. My question is, is there a better place to find contact maps, am I misinterpreting the contact maps produced by the tools, or overlooking something?

DCA Direct Coupling Analysis protein • 896 views

ADD COMMENT • link updated 4.6 years ago by jgreener ▴ 390 • written 4.6 years ago by paula ▴ 30

score 1 · Answer 1 · 2020-04-22

1

Entering edit mode

4.6 years ago

jgreener ▴ 390

There are two possibilities that come to mind:

There is some discrepancy between the residue indices used for the DCA methods and in the true contact maps. This could be due to using a different sequence as input, or due to missing residues in the crystal structure.
The DCA methods are just innacurate. Their ability to predict contacts depends heavily on the depth and quality of the sequence alignments used as input.

Either way you should look at the sequences given to the DCA method and in the true contact map and make sure they match up.

If you need software to calculate contact maps from PDB entries you can use Biopython or BioStructures.jl, e.g.

using BioStructures
struc = read(filepath, PDB)
cbetas = collectatoms(struc, cbetaselector)
cmap = ContactMap(cbetas, 8.0)

ADD COMMENT • link 4.6 years ago by jgreener ▴ 390

1

Entering edit mode

Thank you for the tip about BioStructures - I have never used Julia but I'm working on getting it working now. Will the contact maps it predicts be for chain-chain contacts or contacts within a chain? (I'm benchmarking with actin.)

At first I thought the discrepancy between the PDBSum contacts and the contact maps produced by the DCA tools was due to different sequences because PDBSum uses a sequence 4 AA less than the original sequence I was using, but even after rerunning them all with the same sequence the discrepancy remains.

I haven't been using any MSA files- the inputs all seemed to just be asking for a single sequence. Reading through the documentation, it seems that an MSA file of the actin family from Pfam would be best to use, do I have that correct? (Please forgive all of the questions - I'm a student intern in a wet lab with no other bioinformaticians and I've only worked with genomic data, not proteomic before.)

ADD REPLY • link 4.6 years ago by paula ▴ 30

0

Entering edit mode

Will the contact maps it predicts be for chain-chain contacts or contacts within a chain?

It can be for either.

cbetas_A = collectatoms(struc["A"], cbetaselector)
cmap = ContactMap(cbetas_A, 8.0)

gets contacts within a chain, whereas

cbetas_A = collectatoms(struc["A"], cbetaselector)
cbetas_B = collectatoms(struc["B"], cbetaselector)
cmap = ContactMap(cbetas_A, cbetas_B, 8.0)

gets contacts between chains.

Reading through the documentation, it seems that an MSA file of the actin family from Pfam would be best to use

This explains the discrepancy then. DCA methods are statistical and require MSAs to work, preferably deep ones (a few hundred sequences or more). It is possible that some methods you are using generate a MSA as part of the pipeline, but if you are comparing methods I would get your own MSA and use that as input to all the methods.

I'm a student intern in a wet lab with no other bioinformaticians

This can be difficult and is a common problem in bioinformatics. If possible I would try and find someone at your institution with relevant expertise to help.

ADD REPLY • link 4.6 years ago by jgreener ▴ 390

1

Entering edit mode

If possible I would try and find someone at your institution with relevant expertise to help.

Yes, the original plan was for me to get guidance on projects from the bioinformatics core at the institution, but since my internship started remotely as all in-person research activities shut down due to COVID-19, that has gone out the window for the time being, and I'm muddling through as best I can. Thank you so much for your help, this has given me more guidance than a week and a half of deciphering documentation.

ADD REPLY • link 4.6 years ago by paula ▴ 30

0

Entering edit mode

Makes sense. Not the easiest time to start a position for sure, but I'm sure you'll muddle through.

ADD REPLY • link 4.6 years ago by jgreener ▴ 390