Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LLaMA-2 #23

Open
junzhang-zj opened this issue Sep 16, 2023 · 17 comments
Open

Support for LLaMA-2 #23

junzhang-zj opened this issue Sep 16, 2023 · 17 comments

Comments

@junzhang-zj
Copy link

I couldn't reach 'allenai/c4' on the Hub.

@junzhang-zj
Copy link
Author

junzhang-zj commented Sep 17, 2023

I have solved the data problem, but I ran into a new problem. I used wanda to prune LLaMA-2-13B and got a zero score on rouge-2 of CNN/DM, my perplexity of C4 on unstructured pruning is high to 56050.3008.

@junzhang-zj junzhang-zj changed the title Data problem Data & Rouge-2 on CNN/DM problem Sep 17, 2023
@Eric-mingjie
Copy link
Collaborator

Eric-mingjie commented Sep 23, 2023

Hi, we just updated the repo supporting pruning LLaMA-2 model, see here for the corresponding command. We also provide the results from our own run.

@Eric-mingjie Eric-mingjie changed the title Data & Rouge-2 on CNN/DM problem Support for LLaMA2 Sep 23, 2023
@Eric-mingjie Eric-mingjie changed the title Support for LLaMA2 Support for LLaMA-2 Sep 23, 2023
@junzhang-zj
Copy link
Author

@Eric-mingjie Thanks!

@junzhang-zj
Copy link
Author

junzhang-zj commented Sep 23, 2023

@Eric-mingjie Is the performance of ppl related to the environment, I still get poor results on LLaMA-2.

@Eric-mingjie
Copy link
Collaborator

Eric-mingjie commented Sep 23, 2023

I think for LLaMA-2, i used the transformers library with version 4.34.0.dev0 to load the models. I used this commit 0a55d9f7376f72ad3ff296d4249840021b03bcc4 on the main branch specifically. What ppl number do you get?

@junzhang-zj
Copy link
Author

My environment is transformers 4.34.0.dev0, accelerate 0.24.0.dev0 and I get ppl 146760.7188 and now a lot of cuda errors.

@Eric-mingjie
Copy link
Collaborator

hmm, can you load the llama-2-7b dense model and test the perplexity, in this case, you can simply pass --sparsity_ratio 0 to avoid doing any pruning?

@junzhang-zj
Copy link
Author

OK, i will try it to check.

@Eric-mingjie
Copy link
Collaborator

This is the output of conda env export from the conda environment i am running. Hope this may be helpful. https://gist.github.com/Eric-mingjie/4ca851c64144d53800d60e4c74ebfbaf

@junzhang-zj
Copy link
Author

junzhang-zj commented Sep 23, 2023

@Eric-mingjie I get ppl wikitext_train 5.171178340911865, wikitext_test 4.883730888366699 on Llama-2-13b with no pruning.

@junzhang-zj
Copy link
Author

junzhang-zj commented Sep 23, 2023

I think it might be helpful to think in terms of why 'wrapped_layers[name].scaler_row' is the all-0 tensor causing the metric to fail, have you run into this? Looks like something's wrong with the hook.

@junzhang-zj
Copy link
Author

junzhang-zj commented Sep 23, 2023

😭, I finally found the bug, we need to set pretraining_tp to 1, otherwise, the forward will not be executed and the callback will fail. ppl of llama-2-13b (4:8) on wikitext_train 7.27443265914917, wikitext_test 7.004149913787842

@Eric-mingjie
Copy link
Collaborator

That's good to know. I was starting to rerun the code on my end.

@simlaharma
Copy link

simlaharma commented Jan 26, 2024

I couldn't reach 'allenai/c4' on the Hub.

Hello @junzhang-zj ,
How did you solve the data problem?
I get the following message:

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

I changed the code for the c4 data to the following:

traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Then, I started getting the following error:

File "/simla/wanda/lib/data.py", line 48, in get_c4
traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1118, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/.local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 92, in verify_splits
raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

I tried downloading with:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

After downloading the whole dataset, I need to change the load_dataset function to call the local files. So I did the following:

traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
 valdata = load_dataset('/simla/wanda/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation', trust_remote_code=True)

Now I am getting the following error:

Failed to read file '/simla/wanda/c4/en/c4-train.00000-of-01024.json.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Generating train split: 0%| | 0/364868892 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
dataset = json.load(f)
File "/usr/lib/python3.10/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables
raise e
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
pa_table = paj.read_json(
File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/simla/wanda/main.py", line 110, in
main()
File "/simla/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/simla/wanda/lib/prune.py", line 132, in prune_wanda
dataloader, _ = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer)
File "/simla/wanda/lib/data.py", line 80, in get_loaders
return get_c4(nsamples, seed, seqlen, tokenizer)
File "/simla/wanda/lib/data.py", line 50, in get_c4
traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

@junzhang-zj
Copy link
Author

@simlaharma Have you tried downloading directly from the huggingface website and then loading it locally?

@rsong0606
Copy link

@simlaharma

I had a similar issue as you did. check this post, it worked for me.

huggingface/datasets#6746

@rakeshsai22
Copy link

can we use Wanda for pruning the last linear layer in Llama 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants