Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intractable graph size scaling for large proteins #217

Open
tristanic opened this issue Jun 5, 2024 · 4 comments
Open

Intractable graph size scaling for large proteins #217

tristanic opened this issue Jun 5, 2024 · 4 comments

Comments

@tristanic
Copy link

Hi,

As I understand it one of the end goals of ESPALOMA is to be able to parameterise an entire system without the need for individual residue templates (what got me excited about it in the first place - the promise of straightforward handling of new covalent modifications to protein or DNA residues is particularly alluring). Unfortunately it looks like that won't be possible with the current implementation for any but the smallest proteins, due to the scaling in size of the heterograph with number of atoms. Reading in a protein from PDB with:

 import espaloma as esp
 from openff.toolkit import Topology
 top = Topology.from_pdb('protein.pdb')
 mol = top.molecule(0)
 mgraph = esp.Graph(mol)

My first attempt (with a ~700-residue protein) was killed by the Linux OOM killer after chewing > 22 GB of system RAM. Trying with a series of smaller poly-A models (replication files attached):

import espaloma as esp
from openff.toolkit import Topology

with open('graph_size.csv', 'wt') as out:
    print('Residues,Atoms,Nodes,Edges', file=out)
    for res_count in (5,10,20,30,40,50,75,100):
        top = Topology.from_pdb(f'ala{res_count}.pdb')
        mol = top.molecule(0)
        mgraph = esp.Graph(mol)
        het = mgraph.heterograph
        print(f'{res_count},{mol.n_atoms},{het.number_of_nodes()},{het.number_of_edges()}', file=out)

... shows the node and edge count both scaling as O(n**2) - a 1,003-atom model gives a heterograph with just over a million nodes and 6 million edges. This seems excessive to me, but I don't yet understand enough about the architecture to know the reasons for it. Extrapolating out, a (still pretty reasonably-sized) 10k-atom protein would yield a graph with about 100 million nodes and 600 million edges.

espaloma_graph_size

Can you shed some light on what's going on, and do you have any ideas on how to improve on this?

espaloma_polya.tar.gz

@diogomart
Copy link

We have developmental code to represent each residue as an individual RDKit molecule, like a "chorizo" in which each residue is one of the links. Residues carry a few extra atoms to model the chemistry of adjacent residues, as well as lists of atom indices to keep track of what's "real" and what's padding. It would be a bit of work to generalize it but as it stands we get at least espaloma charges for entire proteins and nucleic acids.

@tristanic
Copy link
Author

That's one sensible solution, which may in fact be preferable in an interactive environment like where I want to apply it, since it'd cut down the cost of reparameterising after modifications to a large protein... just re-do the affected region, rather than the whole thing. But I'm more curious about whether there are ways to improve this scaling in the first place... without digging deeply into the code, a naive interpretation of this would suggest that each individual atom is getting its own all-atom subgraph. That doesn't feel right to me, but it's totally possible I'm missing something fundamental.

@ijpulidos
Copy link
Contributor

@tristanic I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing.

In private communications with @yuanqing-wang I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now.

@tristanic
Copy link
Author

tristanic commented Jun 25, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants