-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61
base: master
Are you sure you want to change the base?
Conversation
tokens = tokenize("I love julia language", gpt2_tokenizer) | ||
@test ids_from_tokens(tokens, gpt2_tokenizer) == [40, 1842, 474, 43640, 3303] | ||
@test sentence_from_tokens_gpt2(tokens) == "I love julia language" | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe test a few more edge cases, rather than just the base case?
""" | ||
function load(path; unk_token="<unk>") | ||
""" | ||
function load_sp(path; unk_token="<unk>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load function in GPT2 tokenizer was overriding this function. So, I changed it to separate functions that can be called by main load metthod. I'm not sure whether this way is optimized to performance or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, It is fine to use load_sp
(better call it load_spu
(sentencepiece unigram)) and load_gpt2
so that we can call it from the main load
src/WordTokenizers.jl
Outdated
@@ -17,7 +17,7 @@ export poormans_tokenize, punctuation_space_tokenize, | |||
set_tokenizer, set_sentence_splitter, | |||
rev_tokenize, rev_detokenize, | |||
toktok_tokenize | |||
export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens | |||
export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens, GPT2, GPT2Tokenizer, tokenize, sentence_from_tokens_gpt2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all these exports needed? In particular I'm afraid that the name GPT2
here will clash with the actual model whenever implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right. There might be a better alternative for this. It is consistent with ALBERT_V1 and goes with load(ALBERT_v1) so i did this. I just realised there's no need to export GPT2Tokenizer as well. I'll correct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better to have common APIs for all the statical Tokenizers
For instance, sentence_from_tokens can be shared between ALBERT and GPT2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasvaidhyadev Can I make all the API as of the format f(tokens/text, tokenizer(spm/gpt2))
instead of f(tokenizer(spm/gpt2), tokens/text)
? I feel it might be more intuitive for users this way.
I don't know why it is getting this build error on julia_version=1.1 @aviks @Ayushk4 @oxinabox help needed.
|
Hello everyone,
This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in WordTokenizers.jl. This might be helpful in future if developing end-to-end pipeline on top of GPT2 model in Julia.
Though I have added tests, suggestions/corrections would be helpful :)