-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intractable graph size scaling for large proteins #217
Comments
We have developmental code to represent each residue as an individual RDKit molecule, like a "chorizo" in which each residue is one of the links. Residues carry a few extra atoms to model the chemistry of adjacent residues, as well as lists of atom indices to keep track of what's "real" and what's padding. It would be a bit of work to generalize it but as it stands we get at least espaloma charges for entire proteins and nucleic acids. |
That's one sensible solution, which may in fact be preferable in an interactive environment like where I want to apply it, since it'd cut down the cost of reparameterising after modifications to a large protein... just re-do the affected region, rather than the whole thing. But I'm more curious about whether there are ways to improve this scaling in the first place... without digging deeply into the code, a naive interpretation of this would suggest that each individual atom is getting its own all-atom subgraph. That doesn't feel right to me, but it's totally possible I'm missing something fundamental. |
@tristanic I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing. In private communications with @yuanqing-wang I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now. |
@ijpulidos thanks for the feedback. We'll have a think about what to do
next.
…On Tue, Jun 11, 2024 at 10:27 PM Iván Pulido ***@***.***> wrote:
@tristanic <https://github.com/tristanic> I can reproduce your results,
and indeed it's creating pretty big graphs. I checked the code that's
consuming most of the the memory and it boils down to this line
<https://github.com/choderalab/espaloma/blob/cb8e5b23e3ec1ada356128debc6a2a5511ef0b98/espaloma/graphs/utils/read_heterogeneous_graph.py#L272>.
Unfortunately, I don't see how we could make that line consume
significantly less memory, especially with the restrictions that DGL is
already imposing.
In private communications with @yuanqing-wang
<https://github.com/yuanqing-wang> I think he has proposed ways to modify
the architecture for this to be more memory efficient, but I don't think
that's a quick fix/thing to do right now.
—
Reply to this email directly, view it on GitHub
<#217 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFM54YFWHT23EN3XHBFHSXDZG5TV3AVCNFSM6AAAAABI2VVZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGYZDIOBUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Altos Labs UK Limited | England | Company reg 13484917
Registered
address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom,
WA14 2DT
|
Hi,
As I understand it one of the end goals of ESPALOMA is to be able to parameterise an entire system without the need for individual residue templates (what got me excited about it in the first place - the promise of straightforward handling of new covalent modifications to protein or DNA residues is particularly alluring). Unfortunately it looks like that won't be possible with the current implementation for any but the smallest proteins, due to the scaling in size of the heterograph with number of atoms. Reading in a protein from PDB with:
My first attempt (with a ~700-residue protein) was killed by the Linux OOM killer after chewing > 22 GB of system RAM. Trying with a series of smaller poly-A models (replication files attached):
... shows the node and edge count both scaling as O(n**2) - a 1,003-atom model gives a heterograph with just over a million nodes and 6 million edges. This seems excessive to me, but I don't yet understand enough about the architecture to know the reasons for it. Extrapolating out, a (still pretty reasonably-sized) 10k-atom protein would yield a graph with about 100 million nodes and 600 million edges.
Can you shed some light on what's going on, and do you have any ideas on how to improve on this?
espaloma_polya.tar.gz
The text was updated successfully, but these errors were encountered: