Intractable graph size scaling for large proteins #217

tristanic · 2024-06-05T13:03:09Z

Hi,

As I understand it one of the end goals of ESPALOMA is to be able to parameterise an entire system without the need for individual residue templates (what got me excited about it in the first place - the promise of straightforward handling of new covalent modifications to protein or DNA residues is particularly alluring). Unfortunately it looks like that won't be possible with the current implementation for any but the smallest proteins, due to the scaling in size of the heterograph with number of atoms. Reading in a protein from PDB with:

 import espaloma as esp
 from openff.toolkit import Topology
 top = Topology.from_pdb('protein.pdb')
 mol = top.molecule(0)
 mgraph = esp.Graph(mol)

My first attempt (with a ~700-residue protein) was killed by the Linux OOM killer after chewing > 22 GB of system RAM. Trying with a series of smaller poly-A models (replication files attached):

import espaloma as esp
from openff.toolkit import Topology

with open('graph_size.csv', 'wt') as out:
    print('Residues,Atoms,Nodes,Edges', file=out)
    for res_count in (5,10,20,30,40,50,75,100):
        top = Topology.from_pdb(f'ala{res_count}.pdb')
        mol = top.molecule(0)
        mgraph = esp.Graph(mol)
        het = mgraph.heterograph
        print(f'{res_count},{mol.n_atoms},{het.number_of_nodes()},{het.number_of_edges()}', file=out)

... shows the node and edge count both scaling as O(n**2) - a 1,003-atom model gives a heterograph with just over a million nodes and 6 million edges. This seems excessive to me, but I don't yet understand enough about the architecture to know the reasons for it. Extrapolating out, a (still pretty reasonably-sized) 10k-atom protein would yield a graph with about 100 million nodes and 600 million edges.

Can you shed some light on what's going on, and do you have any ideas on how to improve on this?

espaloma_polya.tar.gz

The text was updated successfully, but these errors were encountered:

diogomart · 2024-06-06T00:57:43Z

We have developmental code to represent each residue as an individual RDKit molecule, like a "chorizo" in which each residue is one of the links. Residues carry a few extra atoms to model the chemistry of adjacent residues, as well as lists of atom indices to keep track of what's "real" and what's padding. It would be a bit of work to generalize it but as it stands we get at least espaloma charges for entire proteins and nucleic acids.

tristanic · 2024-06-06T14:58:30Z

That's one sensible solution, which may in fact be preferable in an interactive environment like where I want to apply it, since it'd cut down the cost of reparameterising after modifications to a large protein... just re-do the affected region, rather than the whole thing. But I'm more curious about whether there are ways to improve this scaling in the first place... without digging deeply into the code, a naive interpretation of this would suggest that each individual atom is getting its own all-atom subgraph. That doesn't feel right to me, but it's totally possible I'm missing something fundamental.

ijpulidos · 2024-06-11T21:27:34Z

@tristanic I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing.

In private communications with @yuanqing-wang I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now.

tristanic · 2024-06-25T09:07:05Z

@ijpulidos thanks for the feedback. We'll have a think about what to do next.

…

On Tue, Jun 11, 2024 at 10:27 PM Iván Pulido ***@***.***> wrote: @tristanic <https://github.com/tristanic> I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line <https://github.com/choderalab/espaloma/blob/cb8e5b23e3ec1ada356128debc6a2a5511ef0b98/espaloma/graphs/utils/read_heterogeneous_graph.py#L272>. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing. In private communications with @yuanqing-wang <https://github.com/yuanqing-wang> I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now. — Reply to this email directly, view it on GitHub <#217 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFM54YFWHT23EN3XHBFHSXDZG5TV3AVCNFSM6AAAAABI2VVZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGYZDIOBUHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intractable graph size scaling for large proteins #217

Intractable graph size scaling for large proteins #217

tristanic commented Jun 5, 2024

diogomart commented Jun 6, 2024

tristanic commented Jun 6, 2024

ijpulidos commented Jun 11, 2024

tristanic commented Jun 25, 2024 via email

Intractable graph size scaling for large proteins #217

Intractable graph size scaling for large proteins #217

Comments

tristanic commented Jun 5, 2024

diogomart commented Jun 6, 2024

tristanic commented Jun 6, 2024

ijpulidos commented Jun 11, 2024

tristanic commented Jun 25, 2024 via email