This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.
Further details can be found here
We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.
We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.
We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run
sbatch --array=1-N mup_train.sh