Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

shikhargoswami · 2021-03-17T15:50:35Z

Hello everyone,
This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in WordTokenizers.jl. This might be helpful in future if developing end-to-end pipeline on top of GPT2 model in Julia.
Though I have added tests, suggestions/corrections would be helpful :)

Project.toml

Manifest.toml

aviks · 2021-03-18T11:37:39Z

test/gpt2_tokenizer.jl

+    tokens = tokenize("I love julia language", gpt2_tokenizer)
+    @test ids_from_tokens(tokens, gpt2_tokenizer) == [40, 1842, 474, 43640, 3303]
+    @test sentence_from_tokens_gpt2(tokens) == "I love julia language"
+end


Maybe test a few more edge cases, rather than just the base case?

aviks · 2021-03-18T11:39:09Z

src/statistical/unigram.jl

-"""    
-function load(path; unk_token="<unk>")
+"""
+function load_sp(path; unk_token="<unk>")


Not sure about the rationale of this this change. Why is this needed? @oxinabox or @Ayushk4 should take a look here.

The load function in GPT2 tokenizer was overriding this function. So, I changed it to separate functions that can be called by main load metthod. I'm not sure whether this way is optimized to performance or not.

I think, It is fine to use load_sp (better call it load_spu (sentencepiece unigram)) and load_gpt2 so that we can call it from the main load

aviks · 2021-03-18T11:40:05Z

src/WordTokenizers.jl

@@ -17,7 +17,7 @@ export poormans_tokenize, punctuation_space_tokenize,
       set_tokenizer, set_sentence_splitter,
       rev_tokenize, rev_detokenize,
       toktok_tokenize
-export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens
+export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens, GPT2, GPT2Tokenizer, tokenize, sentence_from_tokens_gpt2


are all these exports needed? In particular I'm afraid that the name GPT2 here will clash with the actual model whenever implemented.

Yeah you're right. There might be a better alternative for this. It is consistent with ALBERT_V1 and goes with load(ALBERT_v1) so i did this. I just realised there's no need to export GPT2Tokenizer as well. I'll correct it.

It will be better to have common APIs for all the statical Tokenizers
For instance, sentence_from_tokens can be shared between ALBERT and GPT2.

@tejasvaidhyadev Can I make all the API as of the format f(tokens/text, tokenizer(spm/gpt2)) instead of f(tokenizer(spm/gpt2), tokens/text)? I feel it might be more intuitive for users this way.

shikhargoswami · 2021-03-19T07:47:52Z

I don't know why it is getting this build error on julia_version=1.1 @aviks @Ayushk4 @oxinabox help needed.

Testing WordTokenizers
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package MbedTLS [739be429]:
 MbedTLS [739be429] log:
 ├─possible versions are: [0.5.13-0.5.14, 0.6.0-0.6.8, 0.7.0, 1.0.0-1.0.3] or uninstalled
 ├─restricted to versions 1.0.3 by an explicit requirement, leaving only versions 1.0.3
 └─restricted by julia compatibility requirements to versions: [0.5.13-0.5.14, 0.6.0-0.6.8] or uninstalled — no versions left
Stacktrace:
 [1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1007
 [2] propagate_constraints! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:948 [inlined]
 [3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462
 [4] simplify_graph! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462 [inlined] (repeats 2 times)
 [5] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Nothing) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:371
 [6] resolve_versions! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:315 [inlined]
 [7] #add_or_develop#63(::Array{Base.UUID,1}, ::Symbol, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1172
 [8] add_or_develop at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1156 [inlined]
 [9] (::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context})(::String) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:874
 [10] mktempdir(::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context}, ::String) at .\file.jl:581
 [11] mktempdir at .\file.jl:579 [inlined]
 [12] #with_dependencies_loadable_at_toplevel#38(::Bool, ::Function, ::getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}}, ::Pkg.Types.Context, ::Pkg.Types.PackageSpec) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:853
 [13] #with_dependencies_loadable_at_toplevel at .\none:0 [inlined]
 [14] #test#66(::Bool, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1319
 [15] #test at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:0 [inlined]
 [16] #test#46(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:198
 [17] #test at .\none:0 [inlined]
[18] #test#45 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:180 [inlined]
 [19] #test at .\none:0 [inlined]
 [20] #test#42 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:177 [inlined]
 [21] (::getfield(Pkg.API, Symbol("#kw##test")))(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test)) at .\none:0
 [22] top-level scope at none:0

shikhargoswami added 2 commits March 17, 2021 20:26

Adding GPT2 tokenizer to WordEmbeddings

8646640

Added tests

179c517

shikhargoswami changed the title ~~Adding GPT2 Tokenizer for WordEmbeddings' Pretrained tokenizers~~ Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers Mar 17, 2021

Fixedloading issue

de91b91

aviks reviewed Mar 18, 2021

View reviewed changes

Added more tests and did required changes

1d90ed2

shikhargoswami added 3 commits March 22, 2021 13:01

Standardised API to match with existing one

6f2f448

Corrected and Modified README

6d2e8a9

Deleted Manifest.toml

c03685c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

shikhargoswami commented Mar 17, 2021

aviks Mar 18, 2021

aviks Mar 18, 2021

shikhargoswami Mar 18, 2021

tejasvaidhyadev Mar 22, 2021

aviks Mar 18, 2021

shikhargoswami Mar 18, 2021

tejasvaidhyadev Mar 22, 2021

shikhargoswami Mar 22, 2021

shikhargoswami commented Mar 19, 2021

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

Are you sure you want to change the base?

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

Conversation

shikhargoswami commented Mar 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shikhargoswami commented Mar 19, 2021