Unlearning

This folder contains implementations for machine unlearning methods on LLM360 models. Machine unlearning is a pre-deployment safety measure designed to remove hazardous knowledge from language models. Unlearned models are inherently safe, as they lack the knowledge to be misused.

Overview

Here's a list of unlearning methods we have implemented so far.

Method	Model
max_entropy	CrystalChat
min_posterior	CrystalChat
random_matching	CrystalChat
RMU	CrystalChat

Directory Structure

unlearn.py is the main entrypoint for running unlearning methods. It uses python modules in methods/ and utils/ folders.

The methods/ folder contains the implementations for unlearning methods:

training.py: All training loop implementations
utils.py: Loss functions and other method-related utils

The utils/ folder contains helper functions for model/dataset IO:

data_utils.py: Dataloader for text datasets
model_utils.py: Model IO utils

By default, unlearned models are saved to models/ folder. Please store all training datasets to the data/ folder.

Note

This project uses the bio-forget-corpus from the WMDP Benchmark for unlearning training. Access to this dataset requires a separate request. Please follow the instructions provided here to obtain the necessary permissions. By default, the dataloader is configured to load the dataset from data/bio_forget.jsonl.

Installation

Clone and enter the repo:

git clone https://github.com/LLM360/Analysis360.git
cd Analysis360/analysis/unlearning

Install dependencies:
```
pip install -r requirements.txt
```
To install lm-eval, please check the installation instructions in the metrics/harness folder.

Quick Start

Training and Evaluation

An example usage is provided in the demo.ipynb, which can be executed with a single A100 80G GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unlearning

Table of Contents

Overview

Directory Structure

Installation

Quick Start

Training and Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unlearning

Table of Contents

Overview

Directory Structure

Installation

Quick Start

Training and Evaluation