Support local embedding engine for michaelfeil/infinity as langchain community #17670

michaelfeil · 2024-02-17T05:08:00Z

michaelfeil
Feb 17, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Infinity is a framework for fast embedding inference - its a pure python framework, that uses things like dynamic batching, flash-attn2, faster tokenization and torch compile to speed up inference.

Motivation

I spend enough compute, lets save this planet some electricity (and time). :)

Flash-attention and torch compile lead to ~2x speedup, while async tokenization gives you another 1.5x speedup. Then there is forgetting to use fp16.

In summary, over not property batched (aka you split your 1M words into chunks of 32, sorted ascending), but send chunks as they come, you spend around 22x longer over non-fp16, non-batching pip install sentence-transformers

Proposal (If applicable)

Integrate this framework as langchain community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support local embedding engine for michaelfeil/infinity as langchain community #17670

{{title}}

Replies: 0 comments

Select a reply

Support local embedding engine for michaelfeil/infinity as langchain community #17670

michaelfeil Feb 17, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

michaelfeil
Feb 17, 2024