High level goals

This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.

Further details can be found here

Bio-LM PubChem Selfies

We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.

We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.

We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run

sbatch --array=1-N mup_train.sh

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
bio_lm		bio_lm
sweeps		sweeps
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
start_multinode.sh		start_multinode.sh
train.sh		train.sh
train_sweep.sh		train_sweep.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High level goals

Bio-LM PubChem Selfies

About

Releases

Packages

Contributors 3

Languages

OpenBioML/bio-chem-lm

Folders and files

Latest commit

History

Repository files navigation

High level goals

Bio-LM PubChem Selfies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages