Skip to content

vetka925/gpt-j-8bit-lightning-finetune

Repository files navigation

gpt-j-8bit-lightning-finetune

Research a finetuning of GPT-j-8bit with Pytorch Lightning.

The purpose of this repo to make little research of GPT like models and approaches to finetune quantized GPT-J. Сlassification task was chosen as a test task. I compared accuracy of three approaches to finetune GPT-j8bit. Also, I compared final metrics with metrics of such OpenAI GPT-3 models as Ada and DaVinci. This code can be reused to finutene GPT like models.

System requirements

  1. At least 11 GB of VRAM
  2. Linux (required for bitsandbytes package)

This code was tested on WSL Ubuntu 22.04, Geforce GTX 1080 TI, Cuda toolkit 11.7

Usage

To reproduce results locally*:

  1. Prepare environment
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  1. Clone repo
git clone gpt-j-8bit-lightning-finetune
  1. Install requirements**
cd gpt-j-8bit-lightning-finetune
pip install -r requirements.txt
  1. Run Jupyter notebook finetune.ipynb
jupyter notebook

**For possible issues with bitsandbytes on WSL use this

*Or you can run this Kaggle notebook with P100 GPU

Description

Full research description on Medium, Habr

Finetuning and approach comparison: finetune.ipynb
Finetuning OpenAI model: compare_openai.ipynb
Fewshot example: fewshot.ipynb

Test task is Hate Speech and Offensive Language Detection.
Data: 1000 train and 200 validation samples with balanced classes from Hate Speech and Offensive Language Dataset

This repo leverage 3 approaches to Finetune GPT-J-8bit.

  • Train LayerNorm layers
  • Train low ranked adapters for Linear layers in Attention blocks
  • Train low ranked adapters for all Linear layers

Also Few Shot approach was validated too.

Why do we need Low ranked adapters?
In GPT-j-8bit, the parameters are quantized. Training quantized integer parameters with conventional algorithms is not a reasonable approach if only because the range of cross-entropy loss values lies in [0, 1]. But even quantization does not free us from training a huge number of parameters and the costs of calculations. It is possible to train only low-dimenisonal adapters. Low ranked adapters (LoRA) described in this paper

How dataset was prepared?
The way that you pass the data to the model is significant. Instruction based and raw text prompts were used.

What next?
Research fast inference.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published