Support local embedding engine for michaelfeil/infinity as langchain community #17670
michaelfeil
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checked
Feature request
Infinity is a framework for fast embedding inference - its a pure python framework, that uses things like dynamic batching, flash-attn2, faster tokenization and torch compile to speed up inference.
Motivation
I spend enough compute, lets save this planet some electricity (and time). :)
Flash-attention and torch compile lead to ~2x speedup, while async tokenization gives you another 1.5x speedup. Then there is forgetting to use fp16.
In summary, over not property batched (aka you split your 1M words into chunks of 32, sorted ascending), but send chunks as they come, you spend around
22x
longer over non-fp16, non-batchingpip install sentence-transformers
Proposal (If applicable)
Integrate this framework as langchain community.
Beta Was this translation helpful? Give feedback.
All reactions