Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction-tuning Support #196

Open
wants to merge 81 commits into
base: main
Choose a base branch
from
Open

Instruction-tuning Support #196

wants to merge 81 commits into from

Conversation

lllAlexanderlll
Copy link
Contributor

What does this PR do?

This PR adds support for instruction tuning, by

  1. Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
  2. In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
  3. Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.

General Changes

  • New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
  • A wrapper for collate functions to include tokens which appear between indicator tokens
  • A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token

Breaking Changes

  • None, as the default value for PackedMemMapDatasetContinuous.reuse_last_target is True

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)

rrutmann and others added 30 commits July 15, 2024 13:18
…data. Change symbol for special tokens, which are actaully a single token within the vocab.
Co-authored-by: Alexander Weber <[email protected]>
SFT_README.md Outdated Show resolved Hide resolved
SFT_README.md Outdated Show resolved Hide resolved
SFT_README.md Outdated Show resolved Hide resolved
```

When we add a special token, which exists within the tokenizer voabulary already, HF only marks it as special token (adds it to the trie).
This means, if the sequence we add as special token already exists in the vocab, there is no need to resize the embedding matrix!
Copy link
Member

@le1nux le1nux Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically we represent the special token by at least on of the original tokens in the vocabulary?

For example, assuming "spec_tok_1" is supposed to be a special token, it gets tokenized to e.g., ["spec", "_", "tok", "_", "1"] ? Since it has been added to the Trie, we therefore still know that it is a special token but during training and inference the special token would still be represented by a 5 token sequence and therefore the embedding matrix can remain unchanged, right?

What happens if the special token cannot be represented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite.
Given spec_tok_1 as a special token, which is part of the trie and given the text to tokenize being Hi there spec_tok_1!:
First, the split method of the trie is used to extract all special tokens. In our case, the split would look like this: ["Hi there", "spec_tok_1", "!"], then all strings that are not part of the trie, i.e., the normal tokens, are tokenized leading to e.g. ["Hi", "there", "spec_tok_1", "!"]

@le1nux le1nux requested review from flxst and removed request for mali-git August 22, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants