customizing chunking function #111

DeoLeung · 2024-12-20T09:27:41Z

the get_chunks function is not exposed, it calls tiktoken and pass only tokens into the customized chunking function

I think the origin text should be passed into the chunking function instead, so we could use custom tokenizer base on the embedding model

The text was updated successfully, but these errors were encountered:

rangehow · 2024-12-20T09:35:59Z

It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.

DeoLeung · 2024-12-22T14:44:51Z

It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.

i did a quick pr, if u'r happy with the change, i will fix the examples & docs as well

further, I'm thinking that some dataclass or pydantic model shall be added to have a better input/output hint and validation

DeoLeung linked a pull request Dec 22, 2024 that will close this issue

chunking functions now accept new docs as parameters instread of tokens #113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

customizing chunking function #111

customizing chunking function #111

DeoLeung commented Dec 20, 2024

rangehow commented Dec 20, 2024

DeoLeung commented Dec 22, 2024

customizing chunking function #111

customizing chunking function #111

Comments

DeoLeung commented Dec 20, 2024

rangehow commented Dec 20, 2024

DeoLeung commented Dec 22, 2024