gpt-j-8bit-lightning-finetune

Research a finetuning of GPT-j-8bit with Pytorch Lightning.

The purpose of this repo to make little research of GPT like models and approaches to finetune quantized GPT-J. Сlassification task was chosen as a test task. I compared accuracy of three approaches to finetune GPT-j8bit. Also, I compared final metrics with metrics of such OpenAI GPT-3 models as Ada and DaVinci. This code can be reused to finutene GPT like models.

System requirements

At least 11 GB of VRAM
Linux (required for bitsandbytes package)

This code was tested on WSL Ubuntu 22.04, Geforce GTX 1080 TI, Cuda toolkit 11.7

Usage

To reproduce results locally*:

Prepare environment

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Clone repo

git clone gpt-j-8bit-lightning-finetune

Install requirements**

cd gpt-j-8bit-lightning-finetune
pip install -r requirements.txt

Run Jupyter notebook finetune.ipynb

jupyter notebook

**For possible issues with bitsandbytes on WSL use this

*Or you can run this Kaggle notebook with P100 GPU

Description

Full research description on Medium, Habr

Finetuning and approach comparison: finetune.ipynb
Finetuning OpenAI model: compare_openai.ipynb
Fewshot example: fewshot.ipynb

Test task is Hate Speech and Offensive Language Detection.
Data: 1000 train and 200 validation samples with balanced classes from Hate Speech and Offensive Language Dataset

This repo leverage 3 approaches to Finetune GPT-J-8bit.

Train LayerNorm layers
Train low ranked adapters for Linear layers in Attention blocks
Train low ranked adapters for all Linear layers

Also Few Shot approach was validated too.

Why do we need Low ranked adapters?
In GPT-j-8bit, the parameters are quantized. Training quantized integer parameters with conventional algorithms is not a reasonable approach if only because the range of cross-entropy loss values lies in [0, 1]. But even quantization does not free us from training a huge number of parameters and the costs of calculations. It is possible to train only low-dimenisonal adapters. Low ranked adapters (LoRA) described in this paper

How dataset was prepared?
The way that you pass the data to the model is significant. Instruction based and raw text prompts were used.

What next?
Research fast inference.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitignore		.gitignore
README.md		README.md
compare_openai.ipynb		compare_openai.ipynb
custom_datasets.py		custom_datasets.py
fewshot.ipynb		fewshot.ipynb
finetune.ipynb		finetune.ipynb
finetuning_utils.py		finetuning_utils.py
gpt_finetuner.py		gpt_finetuner.py
gpt_quant_modules.py		gpt_quant_modules.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt-j-8bit-lightning-finetune

System requirements

Usage

Description

About

Releases

Packages

Languages

vetka925/gpt-j-8bit-lightning-finetune

Folders and files

Latest commit

History

Repository files navigation

gpt-j-8bit-lightning-finetune

System requirements

Usage

Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages