-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instruction-tuning Support #196
base: main
Are you sure you want to change the base?
Conversation
…y stopping of generation
…y. Unit test still needed.
…data. Change symbol for special tokens, which are actaully a single token within the vocab.
Co-authored-by: Alexander Weber <[email protected]>
SFT sample generator
Co-authored-by: Max Lübbering <[email protected]>
``` | ||
|
||
When we add a special token, which exists within the tokenizer voabulary already, HF only marks it as special token (adds it to the trie). | ||
This means, if the sequence we add as special token already exists in the vocab, there is no need to resize the embedding matrix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically we represent the special token by at least on of the original tokens in the vocabulary?
For example, assuming "spec_tok_1" is supposed to be a special token, it gets tokenized to e.g., ["spec", "_", "tok", "_", "1"]
? Since it has been added to the Trie, we therefore still know that it is a special token but during training and inference the special token would still be represented by a 5 token sequence and therefore the embedding matrix can remain unchanged, right?
What happens if the special token cannot be represented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite.
Given spec_tok_1
as a special token, which is part of the trie and given the text to tokenize being Hi there spec_tok_1!
:
First, the split method of the trie is used to extract all special tokens. In our case, the split would look like this: ["Hi there", "spec_tok_1", "!"]
, then all strings that are not part of the trie, i.e., the normal tokens, are tokenized leading to e.g. ["Hi", "there", "spec_tok_1", "!"]
…nt to convert, split, and create idx and pdbin files per data partition
What does this PR do?
This PR adds support for instruction tuning, by
data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml
which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.General Changes
data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml
to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was appliedBreaking Changes
PackedMemMapDatasetContinuous.reuse_last_target is True
Checklist before submitting final PR
python tests/tests.py
)