Instruction-tuning Support #196

lllAlexanderlll · 2024-07-29T13:41:21Z

What does this PR do?

This PR adds support for instruction tuning, by

Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.

General Changes

New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
A wrapper for collate functions to include tokens which appear between indicator tokens
A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token

Breaking Changes

None, as the default value for PackedMemMapDatasetContinuous.reuse_last_target is True

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)

…y stopping of generation

…y. Unit test still needed.

…data. Change symbol for special tokens, which are actaully a single token within the vocab.

…mentation.

…ft_with_main

…enerator

Co-authored-by: Alexander Weber <[email protected]>

SFT sample generator

Co-authored-by: Max Lübbering <[email protected]>

SFT_README.md

le1nux · 2024-08-13T14:40:20Z

SFT_README.md

+```
+
+When we add a special token, which exists within the tokenizer voabulary already, HF only marks it as special token (adds it to the trie).
+This means, if the sequence we add as special token already exists in the vocab, there is no need to resize the embedding matrix!


so basically we represent the special token by at least on of the original tokens in the vocabulary?

For example, assuming "spec_tok_1" is supposed to be a special token, it gets tokenized to e.g., ["spec", "_", "tok", "_", "1"] ? Since it has been added to the Trie, we therefore still know that it is a special token but during training and inference the special token would still be represented by a 5 token sequence and therefore the embedding matrix can remain unchanged, right?

What happens if the special token cannot be represented?

Not quite.
Given spec_tok_1 as a special token, which is part of the trie and given the text to tokenize being Hi there spec_tok_1!:
First, the split method of the trie is used to extract all special tokens. In our case, the split would look like this: ["Hi there", "spec_tok_1", "!"], then all strings that are not part of the trie, i.e., the normal tokens, are tokenized leading to e.g. ["Hi", "there", "spec_tok_1", "!"]

…nt to convert, split, and create idx and pdbin files per data partition

rrutmann and others added 30 commits July 15, 2024 13:18

Add example config for the construction of chat templates

59f0191

chore: add chat template config based on jinja2

8b60a83

chore: update chat template config based on jinja2

ba2f65c

chore: Add apply chat template feature with role mapping

47e71c3

chore: extend to multiple chat templates

3303147

fix: data driven chat tempalte key retrieval

0c6bbf5

chore: Add 'index' to output JSONL

32f5756

fix: Add süecical token to be kept during treinaing to allow for earl…

482f7af

…y stopping of generation

chore: Update output file

1d72770

build: Add jsonlines dependency

0bd9bfa

chore: integration of collator wrapper with loss masking functionalit…

ed2f4ce

…y. Unit test still needed.

chore: Use SFT config replaction with uuid as file pair identification.

6e24ea2

chore: Add loss masking test

6e716b4

chore: Merge branch 'main' into sft_loss_masking

a0376a6

fix: copy raw config file for truly original content

70dc498

chore: add pbin file for testing loss masking

242e429

chore: add pbin file with more data for testing loss masking

bddcf8b

chore: use a hash not uuid for showing which config belongs to whoch …

f86b6ed

…data. Change symbol for special tokens, which are actaully a single token within the vocab.

chore: add pbin file for testing loss masking

7632a02

chore: Fix loss masking when starting within an assistant answer

15719a3

chore: add lost collator wrappr again

ab0f34c

chore: fix the loss masking test and the implementation. Improve docu…

0a545ca

…mentation.

chore: Merge commit '15ed069beaa2c83dcd15b087e4d0864b1aec4caa' into s…

c6d0a61

…ft_with_main

chore: Merge branch 'sft_with_main' into sft_loss_masking

0c52856

feat(sft): Do not reuse last targets for Instruction Tuning

12c74bc

Merge remote-tracking branch 'origin/sft_with_main' into sft_sample_g…

fc7bec1

…enerator

refactor(sft): Make reuse_last_target optional

25fdcd7

docs: Correct spelling

01109e2

Update comment

7148e1e

Co-authored-by: Alexander Weber <[email protected]>

Merge pull request #193 from Modalities/sft_sample_generator

75611dd

SFT sample generator

lllAlexanderlll and others added 5 commits August 12, 2024 17:37

chore: Merge branch 'main' into sft_with_main

4d7e53d

chore: fix tokenization tests and renaming of loss masking config field

eee2bac

chore: Update SFT_README.md

4f53f0c

Co-authored-by: Max Lübbering <[email protected]>

docs: add doc strings

ed50d2f

chore: update instruction tuning e2e test with output artifact check

d5867a4

le1nux reviewed Aug 13, 2024

View reviewed changes

SFT_README.md Outdated Show resolved Hide resolved

le1nux reviewed Aug 13, 2024

View reviewed changes

SFT_README.md Outdated Show resolved Hide resolved

le1nux reviewed Aug 13, 2024

View reviewed changes

SFT_README.md Outdated Show resolved Hide resolved

le1nux reviewed Aug 13, 2024

View reviewed changes

lllAlexanderlll and others added 9 commits August 13, 2024 17:13

chore: Update readme

94d89cb

chore: refine names of helper functions and doc strings

42cf6ce

fix: apply renaming

d98a26a

chore: Update SFT_README

d95bd46

chore(sft): Improve check on correctness of loss masked sequences

c6b0e4c

chore(sft): Change special tokens used for instruction tuning

b9fbcec

chore: Add artifacts to .gitignore

6538072

chore(sft): Add splitting functionality and introduce a new entry poi…

2748887

…nt to convert, split, and create idx and pdbin files per data partition

fix(sft): do not append hash twice

72ed828

le1nux requested review from flxst and removed request for mali-git August 22, 2024 12:03

rrutmann and others added 10 commits September 9, 2024 13:21

test(sft): Use special tokens already existing in tokenizers vocabulary

6594633

test(sft): Add data and config for tests

66f0bea

test(sft): Add documentation for test

0daec5b

chore: Pass black check

125311f

chore: Merge remote-tracking branch 'origin/main' into sft_with_main

9cbc8f3

chore: improve error message and readme

b121eea

chore: Update SFT_README.md

396aba5

Update SFT_README.md

8416c9d

chore: Merge branch 'main' into sft_with_main

b08f02f

test: fix failing sft e2e test

eb658c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction-tuning Support #196

Instruction-tuning Support #196

lllAlexanderlll commented Jul 29, 2024

le1nux Aug 13, 2024 •

edited

Loading

lllAlexanderlll Aug 13, 2024

Instruction-tuning Support #196

Are you sure you want to change the base?

Instruction-tuning Support #196

Conversation

lllAlexanderlll commented Jul 29, 2024

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

le1nux Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

lllAlexanderlll Aug 13, 2024

Choose a reason for hiding this comment

le1nux Aug 13, 2024 •

edited

Loading