InferenceSlicer threaded implementation slower than obss/sahi #1695

iokarkan · 2024-11-28T11:57:42Z

Search before asking

I have searched the Supervision issues and found no similar bug report.

Bug

I was trying to establish the performance boost of supervision's SAHI implementation in InferenceSlicer with many worker threads against obss/sahi original implementation, and I cooked up this script below to compare the two.

In my test, It appears that supervision is slower doing one iteration for 256x256 slices of a 1024x527 sample image and using various values for the worker threads. I am skipping some warmup runs, and I am avoiding overlap in both cases.

I believe obss/sahi is single threaded, therefore having worker threads should help.

Indicatively, for 4 worker threads, I get:

{'Implementation': ['obss/sahi', 'supervision'], 'Inference Time (s)': [0.4129594915053424, 1.240290361292222]}

As an aside, I'm also getting verbose SuperVision inference output that I can't find how to disable, but it shouldn't play too big of a role):

0: 416x640 1 tie, 1 vase, 56.9ms
Speed: 1.7ms preprocess, 56.9ms inference, 22.2ms postprocess per image at shape (1, 3, 416, 640)

Environment

OS Ubuntu 22.04
python 3.10.12
requirements

certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
contourpy==1.3.1
cycler==0.12.1
defusedxml==0.7.1
filelock==3.16.1
fire==0.7.0
fonttools==4.55.0
fsspec==2024.10.0
idna==3.10
imagecodecs==2024.9.22
imageio==2.36.0
Jinja2==3.1.4
kiwisolver==1.4.7
lazy_loader==0.4
MarkupSafe==3.0.2
matplotlib==3.9.2
mpmath==1.3.0
networkx==3.4.2
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
opencv-python==4.9.0.80
packaging==24.2
pandas==2.2.3
pillow==11.0.0
psutil==6.1.0
py-cpuinfo==9.0.0
pybboxes==0.1.6
pyparsing==3.2.0
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
sahi==0.11.18
scikit-image==0.24.0
scipy==1.14.1
seaborn==0.13.2
shapely==2.0.6
six==1.16.0
supervision==0.25.0
sympy==1.13.1
termcolor==2.5.0
terminaltables==3.1.10
thop==0.1.1.post2209072238
tifffile==2024.9.20
torch==2.5.1
torchvision==0.20.1
tqdm==4.67.1
triton==3.1.0
typing_extensions==4.12.2
tzdata==2024.2
ultralytics==8.1.27
urllib3==2.2.3

Minimal Reproducible Example

from ultralytics import YOLO
from sahi.auto_model import AutoDetectionModel
from sahi.predict import get_sliced_prediction
from supervision import InferenceSlicer, Detections
import time
from PIL import Image
import cv2
import numpy as np

sample_image_path = "image.jpg"
model_path = "yolov8n.pt"

yolo_model = YOLO(model_path, verbose=False)


# Function to measure obss/sahi inference
def run_obss_sahi_inference(image_path, model):
    detection_model = AutoDetectionModel.from_pretrained(
        model_type="yolov8", model_path=model_path, confidence_threshold=0.25
    )
    times = []
    for _ in range(20):
        start_time = time.time()
        result = get_sliced_prediction(
            image_path,
            detection_model=detection_model,
            postprocess_match_metric="IOU",
            postprocess_match_threshold=0.1,
            slice_height=256,
            slice_width=256,
            overlap_height_ratio=0.0,
            overlap_width_ratio=0.0,
        )
        end_time = time.time()
        times.append(end_time - start_time)
    return result, sum(times[3:]) / len(times[3:])


# Function to measure supervision inference
def run_supervision_inference(image, model):
    def callback(image_slice: np.ndarray) -> Detections:
        result = model(image_slice, conf=0.25, iou=0.1, device=0)[0]
        return Detections.from_ultralytics(result)

    inference_slicer = InferenceSlicer(
        callback=callback,
        slice_wh=(256, 256),
        overlap_wh=None,
        thread_workers=4,
    )
    times = []
    for _ in range(20):
        start_time = time.time()
        detections = inference_slicer(image)
        end_time = time.time()
        times.append(end_time - start_time)
    return detections, sum(times[3:]) / len(times[3:])


def main():
    img_array = cv2.imread(sample_image_path)
    obss_result, obss_time = run_obss_sahi_inference(img_array, yolo_model)
    supervision_result, supervision_time = run_supervision_inference(
        img_array, yolo_model
    )

    comparison_results = {
        "Implementation": ["obss/sahi", "supervision"],
        "Inference Time (s)": [obss_time, supervision_time],
    }

    print(comparison_results)


if __name__ == "__main__":
    main()

Additional

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

LinasKo · 2024-11-28T12:44:18Z

Hi @iokarkan 👋

Thank you for a thorough report. It does not surprise me, as threads work weirdly when involving vision models / access to GPU. We'll look into it, but it might take some time.

Meanwhile, if you're keen on speed, we have an implementation that runs inference in bulk, on a GPU. It's not up-to-date, but if urgent, you might be able to hack it with a custom InferenceSlicer class or some monkeypatching.

#1239

Again, thank you. It's wonderful to receive such a thorough report, with full reproduction steps.

iokarkan · 2024-11-28T12:51:52Z

Hi @LinasKo, thanks for the answer.

I'm interested in real-time scenarios, so single image per-iteration. However if batching refers to sending the all component slice (and/or original image, like SAHI does) as a batch to the GPU, it should be more performant. I'll take a look!

LinasKo · 2024-12-04T10:17:44Z

I'm closing this, as there's not much we can do besides batching, which is already covered by #1239 and related issues.

iokarkan added the bug Something isn't working label Nov 28, 2024

LinasKo closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InferenceSlicer threaded implementation slower than obss/sahi #1695

InferenceSlicer threaded implementation slower than obss/sahi #1695

iokarkan commented Nov 28, 2024 •

edited

Loading

LinasKo commented Nov 28, 2024 •

edited

Loading

iokarkan commented Nov 28, 2024

LinasKo commented Dec 4, 2024

InferenceSlicer threaded implementation slower than obss/sahi #1695

InferenceSlicer threaded implementation slower than obss/sahi #1695

Comments

iokarkan commented Nov 28, 2024 • edited Loading

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

LinasKo commented Nov 28, 2024 • edited Loading

iokarkan commented Nov 28, 2024

LinasKo commented Dec 4, 2024

iokarkan commented Nov 28, 2024 •

edited

Loading

LinasKo commented Nov 28, 2024 •

edited

Loading