Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

customizing chunking function #111

Open
DeoLeung opened this issue Dec 20, 2024 · 2 comments · May be fixed by #113
Open

customizing chunking function #111

DeoLeung opened this issue Dec 20, 2024 · 2 comments · May be fixed by #113

Comments

@DeoLeung
Copy link

the get_chunks function is not exposed, it calls tiktoken and pass only tokens into the customized chunking function

I think the origin text should be passed into the chunking function instead, so we could use custom tokenizer base on the embedding model

@rangehow
Copy link
Collaborator

It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.

@DeoLeung
Copy link
Author

It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.

i did a quick pr, if u'r happy with the change, i will fix the examples & docs as well

further, I'm thinking that some dataclass or pydantic model shall be added to have a better input/output hint and validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants