Improve: Uniform APIs across JS, Py, and Swift

unum-cloud · Apr 23, 2024 · 9bf5fe3 · 9bf5fe3
1 parent 4c1ac18
commit 9bf5fe3
Show file tree

Hide file tree

Showing 17 changed files with 564 additions and 121 deletions.
diff --git a/README.md b/README.md
@@ -51,13 +51,12 @@ With compact __custom pre-trained transformer models__, this can run anywhere fr
 
 ### Embedding Models
 
-| Model                                    | Parameters | Languages |                                 Architecture |
-| :--------------------------------------- | ---------: | --------: | -------------------------------------------: |
-| [`uform-vl-english-large`][model-e-l] 🆕  |       365M |         1 | 6 text layers, ViT-L/14, 6 multimodal layers |
-| [`uform-vl-english`][model-e]            |       143M |         1 | 2 text layers, ViT-B/16, 2 multimodal layers |
-| [`uform-vl-english-small`][model-e-s] 🆕  |        79M |         1 | 2 text layers, ViT-S/16, 2 multimodal layers |
-| [`uform-vl-multilingual-v2`][model-m-v2] |       206M |        21 | 8 text layers, ViT-B/16, 4 multimodal layers |
-| [`uform-vl-multilingual`][model-m]       |       206M |        12 | 8 text layers, ViT-B/16, 4 multimodal layers |
+| Model                                               | Parameters | Languages |                                 Architecture |
+| :-------------------------------------------------- | ---------: | --------: | -------------------------------------------: |
+| [`uform3-image-text-english-large`][model-e-l] 🆕    |       365M |         1 | 6 text layers, ViT-L/14, 6 multimodal layers |
+| [`uform3-image-text-english-base`][model-e]         |       143M |         1 | 2 text layers, ViT-B/16, 2 multimodal layers |
+| [`uform3-image-text-english-small`][model-e-s] 🆕    |        79M |         1 | 2 text layers, ViT-S/16, 2 multimodal layers |
+| [`uform3-image-text-multilingual-base`][model-m-v2] |       206M |        21 | 8 text layers, ViT-B/16, 4 multimodal layers |
 
 [model-e-l]: https://huggingface.co/unum-cloud/uform-vl-english-large/
 [model-e]: https://huggingface.co/unum-cloud/uform-vl-english/
@@ -307,34 +306,18 @@ prompt_len = inputs['input_ids'].shape[1]
 decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
 ```
 
-### Multimodal Chat
+### Multimodal Chat in CLI
 
-The generative models can be used for chat-like experiences, where the user can provide both text and images as input.
-To use that feature, you can start with the following CLI command:
+The generative models can be used for chat-like experiences in the command line.
+For that, you can use the `uform-chat` CLI tool, which is available in the UForm package.
 
 ```bash
-uform-chat --model unum-cloud/uform-gen-chat --image=zebra.jpg
-uform-chat --model unum-cloud/uform-gen-chat \
-    --image="https://bit.ly/3tIVg9M" \
-    --device="cuda:0" \
-    --fp16
-```
-
-### Multi-GPU
-
-To achieve higher throughput, you can launch UForm on multiple GPUs.
-For that pick the encoder of the model you want to run in parallel (`text_encoder` or `image_encoder`), and wrap it in `nn.DataParallel` (or `nn.DistributedDataParallel`).
-
-```python
-import uform
-
-model, processor = uform.get_model('unum-cloud/uform-vl-english')
-model_image = nn.DataParallel(model.image_encoder)
-
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-model_image.to(device)
-
-_, res = model_image(images, 0)
+$ pip install uform
+$ uform-chat --model unum-cloud/uform-gen2-dpo --image=zebra.jpg
+$ uform-chat --model unum-cloud/uform-gen2-dpo \
+>     --image="https://bit.ly/3tIVg9M" \
+>     --device="cuda:0" \
+>     --fp16
 ```
 
 ## Evaluation
@@ -471,3 +454,8 @@ On Apple M2 Arm chips the energy efficiency of inference can exceed that of the
 ## License
 
 All models come under the same license as the code - Apache 2.0.
+
+
+TODO:
+
+- [ ] Download the image if a URL is provided
diff --git a/javascript/README.md b/javascript/README.md
@@ -1,10 +1,67 @@
 # UForm for JavaScript
 
+UForm multimodal AI SDK offers a simple way to integrate multimodal AI capabilities into your JavaScript applications.
+Built around ONNX, the SDK is supposed to work with most runtimes and almost any hardware.
 
+## Installation
+
+There are several ways to install the UForm JavaScript SDK from NPM.
 
 ```bash
-pnpm add uform
-npm add uform
-yarn add uform
+pnpm add uform 
+npm add uform  
+yarn add uform  
+```
+
+## Quick Start
+
+### Embeddings
+
+```js
+import { getModel, Modality } from 'uform';
+import { TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from 'uform';
+
+const { configPath, modalityPaths, tokenizerPath } = await getModel({
+    modelId: 'unum-cloud/uform3-image-text-english-small',
+    modalities: [Modality.TextEncoder, Modality.ImageEncoder],
+    token: null, // Optional Hugging Face token for private models
+    saveDir: null, // Optional directory to save the model to       
+});
+
+const textProcessor = new TextProcessor(configPath, tokenizerPath);
+await textProcessor.init();
+const processedTexts = await textProcessor.process("a small red panda in a zoo");
+
+const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
+await textEncoder.init();
+const textOutput = await textEncoder.encode(processedTexts);
+assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
+await textEncoder.dispose();
+
+const imageProcessor = new ImageProcessor(configPath);
+await imageProcessor.init();
+const processedImages = await imageProcessor.process("path/to/image.png");
+
+const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
+await imageEncoder.init();
+const imageOutput = await imageEncoder.encode(processedImages);
+assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");
 ```
 
+The `textOutput` and `imageOutput` would contain `features` and `embeddings` properties, which are the same as the `features` and `embeddings` properties in the Python SDK.
+The embeddings can later be compared using the cosine similarity or other distance metrics.
+
+### Generative Models
+
+Coming soon ...
+
+## Technical Details
+
+### Faster Search
+
+Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.
+Independent of the quantization level, native JavaScript functionality may be too slow for large-scale search.
+In such cases, consider using [USearch][github-usearch] or [SimSimD][github-simsimd].
+
+[github-usearch]: https://github.com/unum-cloud/usearch
+[github-simsimd]: https://github.com/ashvardanian/simsimd
diff --git a/javascript/encoders.mjs b/javascript/encoders.mjs
@@ -3,7 +3,7 @@ import { InferenceSession, Tensor } from 'onnxruntime-node';
 import { PreTrainedTokenizer } from '@xenova/transformers';
 import sharp from 'sharp';
 
-import { getCheckpoint, Modality } from "./hub.mjs";
+import { getModel, Modality } from "./hub.mjs";
 
 class TextProcessor {
 
@@ -66,7 +66,7 @@ class TextEncoder {
         }
     }
 
-    async forward(inputs) {
+    async encode(inputs) {
         if (!this.session) {
             throw new Error("Session is not initialized.");
         }
@@ -191,7 +191,7 @@ class ImageEncoder {
         }
     }
 
-    async forward(images) {
+    async encode(images) {
         if (!this.session) {
             throw new Error("Session is not initialized.");
         }

diff --git a/javascript/encoders_test.js b/javascript/encoders_test.js
@@ -4,7 +4,7 @@ import path from 'path';
 import assert from 'assert';
 import fetch from 'node-fetch';
 
-import { getCheckpoint, Modality } from "./hub.mjs";
+import { getModel, Modality } from "./hub.mjs";
 import { TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from "./encoders.mjs";
 
 // Check if the HuggingFace Hub API token is set in the environment variable.
@@ -18,7 +18,7 @@ if (!hf_token) {
 }
 
 async function tryGettingCheckpoint(modelId, modalities) {
-    const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
+    const { configPath, modalityPaths, tokenizerPath } = await getModel(
         modelId,
         modalities,
         hf_token,
@@ -60,7 +60,7 @@ async function testGetCheckpoint() {
 
 async function tryTextEncoderForwardPass(modelId) {
     const modalities = [Modality.TextEncoder];
-    const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
+    const { configPath, modalityPaths, tokenizerPath } = await getModel(
         modelId,
         modalities,
         hf_token,
@@ -73,15 +73,15 @@ async function tryTextEncoderForwardPass(modelId) {
 
     const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
     await textEncoder.init();
-    const textOutput = await textEncoder.forward(processedTexts);
+    const textOutput = await textEncoder.encode(processedTexts);
     assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
 
     await textEncoder.dispose();
 }
 
 async function tryImageEncoderForwardPass(modelId) {
     const modalities = [Modality.ImageEncoder];
-    const { configPath, modalityPaths } = await getCheckpoint(
+    const { configPath, modalityPaths } = await getModel(
         modelId,
         modalities,
         hf_token,
@@ -94,7 +94,7 @@ async function tryImageEncoderForwardPass(modelId) {
 
     const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
     await imageEncoder.init();
-    const imageOutput = await imageEncoder.forward(processedImages);
+    const imageOutput = await imageEncoder.encode(processedImages);
     assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");
 
     await imageEncoder.dispose();
@@ -135,7 +135,7 @@ async function fetchImage(url) {
 async function tryCrossReferencingImageAndText(modelId) {
 
     const modalities = [Modality.ImageEncoder, Modality.TextEncoder];
-    const { configPath, modalityPaths, tokenizerPath } = await getCheckpoint(
+    const { configPath, modalityPaths, tokenizerPath } = await getModel(
         modelId,
         modalities,
         hf_token,
@@ -177,8 +177,8 @@ async function tryCrossReferencingImageAndText(modelId) {
         const processedText = await textProcessor.process(text);
         const processedImage = await imageProcessor.process(imageBuffer);
 
-        const textEmbedding = await textEncoder.forward(processedText);
-        const imageEmbedding = await imageEncoder.forward(processedImage);
+        const textEmbedding = await textEncoder.encode(processedText);
+        const imageEmbedding = await imageEncoder.encode(processedImage);
 
         textEmbeddings.push(new Float32Array(textEmbedding.embeddings.cpuData));
         imageEmbeddings.push(new Float32Array(imageEmbedding.embeddings.cpuData));

diff --git a/javascript/hub.mjs b/javascript/hub.mjs
@@ -33,7 +33,7 @@ async function ensureDirectoryExists(dirPath) {
     }
 }
 
-async function getCheckpoint(modelId, modalities, token = null, format = '.onnx', saveDir = './models') {
+async function getModel(modelId, modalities, token = null, format = '.onnx', saveDir = './models') {
     modalities = normalizeModalities(modalities);
 
     const configNames = ['config.json'];
@@ -101,4 +101,4 @@ async function getCheckpoint(modelId, modalities, token = null, format = '.onnx'
     return { configPath, modalityPaths, tokenizerPath };
 }
 
-export { getCheckpoint, Modality };
+export { getModel, Modality };
diff --git a/python/README.md b/python/README.md
@@ -0,0 +1,124 @@
+# UForm Python SDK
+
+UForm multimodal AI SDK offers a simple way to integrate multimodal AI capabilities into your Python applications.
+The SDK doesn't require any deep learning knowledge, PyTorch, or CUDA installation, and can run on almost any hardware.
+
+## Installation
+
+There are several ways to install the UForm Python SDK, depending on the backend you want to use.
+PyTorch is by far the heaviest, but the most capable.
+ONNX is a lightweight alternative that can run on any CPU, and on some GPUs.
+
+```bash
+pip install "uform[torch]"       # For PyTorch
+pip install "uform[onnx]"        # For ONNX on CPU
+pip install "uform[onnx-gpu]"    # For ONNX on GPU, available for some platforms
+pip install "uform[torch,onnx]"  # For PyTorch and ONNX Python tests
+```
+
+## Quick Start
+
+### Embeddings
+
+```py
+from uform import get_model, Modality
+
+import requests
+from io import BytesIO
+from PIL import Image
+
+model_name = 'unum-cloud/uform3-image-text-english-small'
+modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
+processors, models = get_model(model_name, modalities=modalities)
+
+model_text = models[Modality.TEXT_ENCODER]
+model_image = models[Modality.IMAGE_ENCODER]
+processor_text = processors[Modality.TEXT_ENCODER]
+processor_image = processors[Modality.IMAGE_ENCODER]
+
+# Download the image
+text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
+image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
+image_url = Image.open(BytesIO(requests.get(image_url).content))
+
+# The actual inference
+image_data = processor_image(image)
+text_data = processor_text(text)
+image_features, image_embedding = model_image.encode(image_data, return_features=True)
+text_features, text_embedding = model_text.encode(text_data, return_features=True)
+```
+
+### Generative Models
+
+## Technical Details
+
+### Down-casting, Quantization, Matryoshka, and Slicing
+
+Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.
+Switching from `f32` to `f16` is recommended in almost all cases, unless you are running on very old hardware without half-precision support.
+Switching to `i8` with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries.
+Similarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.
+
+```python
+import numpy as np
+
+f32_embedding: np.ndarray = model.encode_text(text_data, return_features=False).detach().cpu().numpy()
+f16_embedding: np.ndarray = f32_embedding.astype(np.float16)
+i8_embedding: np.ndarray = (f32_embedding * 127).astype(np.int8)
+b1_embedding: np.ndarray = np.packbits((f32_embedding > 0).astype(np.uint8))
+```
+
+Alternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.
+
+```python
+import numpy as np
+
+large_embedding: np.ndarray = model.encode_text(text_data, return_features=False).detach().cpu().numpy()
+small_embedding: np.ndarray = large_embedding[:, :256]
+tiny_embedding: np.ndarray = large_embedding[:, :64]
+```
+
+Both approaches are natively supported by the [USearch][github-usearch] vector-search engine and the [SimSIMD][github-simsimd] numerics libraries.
+When dealing with small collections (up to millions of entries) and looking for low-latency cosine distance calculations, you can [achieve 5x-2500x performance improvement][report-simsimd] over Torch, NumPy, SciPy, and vanilla Python using SimSIMD.
+
+```python
+from simsimd import cosine, hamming
+
+distance: float = cosine(f32_embedding, f32_embedding) # 32x SciPy performance on Apple M2 CPU
+distance: float = cosine(f16_embedding, f16_embedding) # 79x SciPy performance on Apple M2 CPU
+distance: float = cosine(i8_embedding, i8_embedding) # 133x SciPy performance on Apple M2 CPU
+distance: float = hamming(b1_embedding, b1_embedding) # 17x SciPy performance on Apple M2 CPU
+```
+
+Similarly, when dealing with large collections (up to billions of entries per server) and looking for high-throughput search, you can [achieve 100x performance improvement][report-usearch] over FAISS and other vector-search solutions using USearch.
+Here are a couple of examples:
+
+```python
+from usearch.index import Index
+
+f32_index = Index(ndim=64, metric='cos', dtype='f32') # for Matryoshka embeddings
+f16_index = Index(ndim=64, metric='cos', dtype='f16') # for Matryoshka embeddings
+i8_index = Index(ndim=256, metric='cos', dtype='i8') # for quantized embeddings
+b1_index = Index(ndim=768, metric='hamming', dtype='b1') # for binary embeddings
+```
+
+[github-usearch]: https://github.com/unum-cloud/usearch
+[github-simsimd]: https://github.com/ashvardanian/simsimd
+[report-usearch]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel
+[report-simsimd]: https://ashvardanian.com/posts/python-c-assembly-comparison/
+
+### Multi-GPU Parallelism
+
+To achieve higher throughput, you can launch UForm on multiple GPUs.
+For that pick the encoder of the model you want to run in parallel, and wrap it in `nn.DataParallel` (or `nn.DistributedDataParallel`).
+
+```python
+from uform import get_model, Modality
+
+encoders, processors = uform.get_model('unum-cloud/uform-vl-english-small', backend='torch', device='gpu')
+
+encoder_image = encoders[Modality.IMAGE_ENCODER]
+encoder_image = nn.DataParallel(encoder_image)
+
+_, res = encoder_image(images, 0)
+```