You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.
It should be added that the length of a chunk is sensitive to both the embedding model and the LLM (API), so it’s hard to say there’s a necessity to stick closely to the implementation of a specific tokenizer. However, providing an interface to measure chunk length would indeed be helpful for offering a degree of user customization. You could try implementing that.
i did a quick pr, if u'r happy with the change, i will fix the examples & docs as well
further, I'm thinking that some dataclass or pydantic model shall be added to have a better input/output hint and validation
the
get_chunks
function is not exposed, it calls tiktoken and pass only tokens into the customized chunking functionI think the origin text should be passed into the chunking function instead, so we could use custom tokenizer base on the embedding model
The text was updated successfully, but these errors were encountered: