Skip to content

Commit

Permalink
Improve: Uniform APIs across JS, Py, and Swift
Browse files Browse the repository at this point in the history
  • Loading branch information
ashvardanian committed Apr 23, 2024
1 parent 4c1ac18 commit 9bf5fe3
Show file tree
Hide file tree
Showing 17 changed files with 564 additions and 121 deletions.
52 changes: 20 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,12 @@ With compact __custom pre-trained transformer models__, this can run anywhere fr

### Embedding Models

| Model | Parameters | Languages | Architecture |
| :--------------------------------------- | ---------: | --------: | -------------------------------------------: |
| [`uform-vl-english-large`][model-e-l] 🆕 | 365M | 1 | 6 text layers, ViT-L/14, 6 multimodal layers |
| [`uform-vl-english`][model-e] | 143M | 1 | 2 text layers, ViT-B/16, 2 multimodal layers |
| [`uform-vl-english-small`][model-e-s] 🆕 | 79M | 1 | 2 text layers, ViT-S/16, 2 multimodal layers |
| [`uform-vl-multilingual-v2`][model-m-v2] | 206M | 21 | 8 text layers, ViT-B/16, 4 multimodal layers |
| [`uform-vl-multilingual`][model-m] | 206M | 12 | 8 text layers, ViT-B/16, 4 multimodal layers |
| Model | Parameters | Languages | Architecture |
| :-------------------------------------------------- | ---------: | --------: | -------------------------------------------: |
| [`uform3-image-text-english-large`][model-e-l] 🆕 | 365M | 1 | 6 text layers, ViT-L/14, 6 multimodal layers |
| [`uform3-image-text-english-base`][model-e] | 143M | 1 | 2 text layers, ViT-B/16, 2 multimodal layers |
| [`uform3-image-text-english-small`][model-e-s] 🆕 | 79M | 1 | 2 text layers, ViT-S/16, 2 multimodal layers |
| [`uform3-image-text-multilingual-base`][model-m-v2] | 206M | 21 | 8 text layers, ViT-B/16, 4 multimodal layers |

[model-e-l]: https://huggingface.co/unum-cloud/uform-vl-english-large/
[model-e]: https://huggingface.co/unum-cloud/uform-vl-english/
Expand Down Expand Up @@ -307,34 +306,18 @@ prompt_len = inputs['input_ids'].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
```

### Multimodal Chat
### Multimodal Chat in CLI

The generative models can be used for chat-like experiences, where the user can provide both text and images as input.
To use that feature, you can start with the following CLI command:
The generative models can be used for chat-like experiences in the command line.
For that, you can use the `uform-chat` CLI tool, which is available in the UForm package.

```bash
uform-chat --model unum-cloud/uform-gen-chat --image=zebra.jpg
uform-chat --model unum-cloud/uform-gen-chat \
--image="https://bit.ly/3tIVg9M" \
--device="cuda:0" \
--fp16
```

### Multi-GPU

To achieve higher throughput, you can launch UForm on multiple GPUs.
For that pick the encoder of the model you want to run in parallel (`text_encoder` or `image_encoder`), and wrap it in `nn.DataParallel` (or `nn.DistributedDataParallel`).

```python
import uform

model, processor = uform.get_model('unum-cloud/uform-vl-english')
model_image = nn.DataParallel(model.image_encoder)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_image.to(device)

_, res = model_image(images, 0)
$ pip install uform
$ uform-chat --model unum-cloud/uform-gen2-dpo --image=zebra.jpg
$ uform-chat --model unum-cloud/uform-gen2-dpo \
> --image="https://bit.ly/3tIVg9M" \
> --device="cuda:0" \
> --fp16
```

## Evaluation
Expand Down Expand Up @@ -471,3 +454,8 @@ On Apple M2 Arm chips the energy efficiency of inference can exceed that of the
## License

All models come under the same license as the code - Apache 2.0.


TODO:

- [ ] Download the image if a URL is provided
63 changes: 60 additions & 3 deletions javascript/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,67 @@
# UForm for JavaScript

UForm multimodal AI SDK offers a simple way to integrate multimodal AI capabilities into your JavaScript applications.
Built around ONNX, the SDK is supposed to work with most runtimes and almost any hardware.

## Installation

There are several ways to install the UForm JavaScript SDK from NPM.

```bash
pnpm add uform
npm add uform
yarn add uform
pnpm add uform
npm add uform
yarn add uform
```

## Quick Start

### Embeddings

```js
import { getModel, Modality } from 'uform';
import { TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from 'uform';

const { configPath, modalityPaths, tokenizerPath } = await getModel({
modelId: 'unum-cloud/uform3-image-text-english-small',
modalities: [Modality.TextEncoder, Modality.ImageEncoder],
token: null, // Optional Hugging Face token for private models
saveDir: null, // Optional directory to save the model to
});

const textProcessor = new TextProcessor(configPath, tokenizerPath);
await textProcessor.init();
const processedTexts = await textProcessor.process("a small red panda in a zoo");

const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
await textEncoder.dispose();

const imageProcessor = new ImageProcessor(configPath);
await imageProcessor.init();
const processedImages = await imageProcessor.process("path/to/image.png");

const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");
```

The `textOutput` and `imageOutput` would contain `features` and `embeddings` properties, which are the same as the `features` and `embeddings` properties in the Python SDK.
The embeddings can later be compared using the cosine similarity or other distance metrics.

### Generative Models

Coming soon ...

## Technical Details

### Faster Search

Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.
Independent of the quantization level, native JavaScript functionality may be too slow for large-scale search.
In such cases, consider using [USearch][github-usearch] or [SimSimD][github-simsimd].

[github-usearch]: https://github.com/unum-cloud/usearch
[github-simsimd]: https://github.com/ashvardanian/simsimd
6 changes: 3 additions & 3 deletions javascript/encoders.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import { InferenceSession, Tensor } from 'onnxruntime-node';
import { PreTrainedTokenizer } from '@xenova/transformers';
import sharp from 'sharp';

import { getCheckpoint, Modality } from "./hub.mjs";
import { getModel, Modality } from "./hub.mjs";

class TextProcessor {

Expand Down Expand Up @@ -66,7 +66,7 @@ class TextEncoder {
}
}

async forward(inputs) {
async encode(inputs) {
if (!this.session) {
throw new Error("Session is not initialized.");
}
Expand Down Expand Up @@ -191,7 +191,7 @@ class ImageEncoder {
}
}

async forward(images) {
async encode(images) {
if (!this.session) {
throw new Error("Session is not initialized.");
}
Expand Down
18 changes: 9 additions & 9 deletions javascript/encoders_test.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import path from 'path';
import assert from 'assert';
import fetch from 'node-fetch';

import { getCheckpoint, Modality } from "./hub.mjs";
import { getModel, Modality } from "./hub.mjs";
import { TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from "./encoders.mjs";

// Check if the HuggingFace Hub API token is set in the environment variable.
Expand All @@ -18,7 +18,7 @@ if (!hf_token) {
}

async function tryGettingCheckpoint(modelId, modalities) {
const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
const { configPath, modalityPaths, tokenizerPath } = await getModel(
modelId,
modalities,
hf_token,
Expand Down Expand Up @@ -60,7 +60,7 @@ async function testGetCheckpoint() {

async function tryTextEncoderForwardPass(modelId) {
const modalities = [Modality.TextEncoder];
const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
const { configPath, modalityPaths, tokenizerPath } = await getModel(
modelId,
modalities,
hf_token,
Expand All @@ -73,15 +73,15 @@ async function tryTextEncoderForwardPass(modelId) {

const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.forward(processedTexts);
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");

await textEncoder.dispose();
}

async function tryImageEncoderForwardPass(modelId) {
const modalities = [Modality.ImageEncoder];
const { configPath, modalityPaths } = await getCheckpoint(
const { configPath, modalityPaths } = await getModel(
modelId,
modalities,
hf_token,
Expand All @@ -94,7 +94,7 @@ async function tryImageEncoderForwardPass(modelId) {

const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.forward(processedImages);
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");

await imageEncoder.dispose();
Expand Down Expand Up @@ -135,7 +135,7 @@ async function fetchImage(url) {
async function tryCrossReferencingImageAndText(modelId) {

const modalities = [Modality.ImageEncoder, Modality.TextEncoder];
const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
const { configPath, modalityPaths, tokenizerPath } = await getModel(
modelId,
modalities,
hf_token,
Expand Down Expand Up @@ -177,8 +177,8 @@ async function tryCrossReferencingImageAndText(modelId) {
const processedText = await textProcessor.process(text);
const processedImage = await imageProcessor.process(imageBuffer);

const textEmbedding = await textEncoder.forward(processedText);
const imageEmbedding = await imageEncoder.forward(processedImage);
const textEmbedding = await textEncoder.encode(processedText);
const imageEmbedding = await imageEncoder.encode(processedImage);

textEmbeddings.push(new Float32Array(textEmbedding.embeddings.cpuData));
imageEmbeddings.push(new Float32Array(imageEmbedding.embeddings.cpuData));
Expand Down
4 changes: 2 additions & 2 deletions javascript/hub.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ async function ensureDirectoryExists(dirPath) {
}
}

async function getCheckpoint(modelId, modalities, token = null, format = '.onnx', saveDir = './models') {
async function getModel(modelId, modalities, token = null, format = '.onnx', saveDir = './models') {
modalities = normalizeModalities(modalities);

const configNames = ['config.json'];
Expand Down Expand Up @@ -101,4 +101,4 @@ async function getCheckpoint(modelId, modalities, token = null, format = '.onnx'
return { configPath, modalityPaths, tokenizerPath };
}

export { getCheckpoint, Modality };
export { getModel, Modality };
124 changes: 124 additions & 0 deletions python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# UForm Python SDK

UForm multimodal AI SDK offers a simple way to integrate multimodal AI capabilities into your Python applications.
The SDK doesn't require any deep learning knowledge, PyTorch, or CUDA installation, and can run on almost any hardware.

## Installation

There are several ways to install the UForm Python SDK, depending on the backend you want to use.
PyTorch is by far the heaviest, but the most capable.
ONNX is a lightweight alternative that can run on any CPU, and on some GPUs.

```bash
pip install "uform[torch]" # For PyTorch
pip install "uform[onnx]" # For ONNX on CPU
pip install "uform[onnx-gpu]" # For ONNX on GPU, available for some platforms
pip install "uform[torch,onnx]" # For PyTorch and ONNX Python tests
```

## Quick Start

### Embeddings

```py
from uform import get_model, Modality

import requests
from io import BytesIO
from PIL import Image

model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

# Download the image
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))

# The actual inference
image_data = processor_image(image)
text_data = processor_text(text)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
```

### Generative Models

## Technical Details

### Down-casting, Quantization, Matryoshka, and Slicing

Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.
Switching from `f32` to `f16` is recommended in almost all cases, unless you are running on very old hardware without half-precision support.
Switching to `i8` with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries.
Similarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.

```python
import numpy as np

f32_embedding: np.ndarray = model.encode_text(text_data, return_features=False).detach().cpu().numpy()
f16_embedding: np.ndarray = f32_embedding.astype(np.float16)
i8_embedding: np.ndarray = (f32_embedding * 127).astype(np.int8)
b1_embedding: np.ndarray = np.packbits((f32_embedding > 0).astype(np.uint8))
```

Alternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.

```python
import numpy as np

large_embedding: np.ndarray = model.encode_text(text_data, return_features=False).detach().cpu().numpy()
small_embedding: np.ndarray = large_embedding[:, :256]
tiny_embedding: np.ndarray = large_embedding[:, :64]
```

Both approaches are natively supported by the [USearch][github-usearch] vector-search engine and the [SimSIMD][github-simsimd] numerics libraries.
When dealing with small collections (up to millions of entries) and looking for low-latency cosine distance calculations, you can [achieve 5x-2500x performance improvement][report-simsimd] over Torch, NumPy, SciPy, and vanilla Python using SimSIMD.

```python
from simsimd import cosine, hamming

distance: float = cosine(f32_embedding, f32_embedding) # 32x SciPy performance on Apple M2 CPU
distance: float = cosine(f16_embedding, f16_embedding) # 79x SciPy performance on Apple M2 CPU
distance: float = cosine(i8_embedding, i8_embedding) # 133x SciPy performance on Apple M2 CPU
distance: float = hamming(b1_embedding, b1_embedding) # 17x SciPy performance on Apple M2 CPU
```

Similarly, when dealing with large collections (up to billions of entries per server) and looking for high-throughput search, you can [achieve 100x performance improvement][report-usearch] over FAISS and other vector-search solutions using USearch.
Here are a couple of examples:

```python
from usearch.index import Index

f32_index = Index(ndim=64, metric='cos', dtype='f32') # for Matryoshka embeddings
f16_index = Index(ndim=64, metric='cos', dtype='f16') # for Matryoshka embeddings
i8_index = Index(ndim=256, metric='cos', dtype='i8') # for quantized embeddings
b1_index = Index(ndim=768, metric='hamming', dtype='b1') # for binary embeddings
```

[github-usearch]: https://github.com/unum-cloud/usearch
[github-simsimd]: https://github.com/ashvardanian/simsimd
[report-usearch]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel
[report-simsimd]: https://ashvardanian.com/posts/python-c-assembly-comparison/

### Multi-GPU Parallelism

To achieve higher throughput, you can launch UForm on multiple GPUs.
For that pick the encoder of the model you want to run in parallel, and wrap it in `nn.DataParallel` (or `nn.DistributedDataParallel`).

```python
from uform import get_model, Modality

encoders, processors = uform.get_model('unum-cloud/uform-vl-english-small', backend='torch', device='gpu')

encoder_image = encoders[Modality.IMAGE_ENCODER]
encoder_image = nn.DataParallel(encoder_image)

_, res = encoder_image(images, 0)
```
Loading

0 comments on commit 9bf5fe3

Please sign in to comment.