Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceSlicer threaded implementation slower than obss/sahi #1695

Closed
1 of 2 tasks
iokarkan opened this issue Nov 28, 2024 · 3 comments
Closed
1 of 2 tasks

InferenceSlicer threaded implementation slower than obss/sahi #1695

iokarkan opened this issue Nov 28, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@iokarkan
Copy link

iokarkan commented Nov 28, 2024

Search before asking

  • I have searched the Supervision issues and found no similar bug report.

Bug

I was trying to establish the performance boost of supervision's SAHI implementation in InferenceSlicer with many worker threads against obss/sahi original implementation, and I cooked up this script below to compare the two.

In my test, It appears that supervision is slower doing one iteration for 256x256 slices of a 1024x527 sample image and using various values for the worker threads. I am skipping some warmup runs, and I am avoiding overlap in both cases.

I believe obss/sahi is single threaded, therefore having worker threads should help.

Indicatively, for 4 worker threads, I get:

{'Implementation': ['obss/sahi', 'supervision'], 'Inference Time (s)': [0.4129594915053424, 1.240290361292222]}

As an aside, I'm also getting verbose SuperVision inference output that I can't find how to disable, but it shouldn't play too big of a role):

0: 416x640 1 tie, 1 vase, 56.9ms
Speed: 1.7ms preprocess, 56.9ms inference, 22.2ms postprocess per image at shape (1, 3, 416, 640)

Environment

  • OS Ubuntu 22.04
  • python 3.10.12
  • requirements
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
contourpy==1.3.1
cycler==0.12.1
defusedxml==0.7.1
filelock==3.16.1
fire==0.7.0
fonttools==4.55.0
fsspec==2024.10.0
idna==3.10
imagecodecs==2024.9.22
imageio==2.36.0
Jinja2==3.1.4
kiwisolver==1.4.7
lazy_loader==0.4
MarkupSafe==3.0.2
matplotlib==3.9.2
mpmath==1.3.0
networkx==3.4.2
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
opencv-python==4.9.0.80
packaging==24.2
pandas==2.2.3
pillow==11.0.0
psutil==6.1.0
py-cpuinfo==9.0.0
pybboxes==0.1.6
pyparsing==3.2.0
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
sahi==0.11.18
scikit-image==0.24.0
scipy==1.14.1
seaborn==0.13.2
shapely==2.0.6
six==1.16.0
supervision==0.25.0
sympy==1.13.1
termcolor==2.5.0
terminaltables==3.1.10
thop==0.1.1.post2209072238
tifffile==2024.9.20
torch==2.5.1
torchvision==0.20.1
tqdm==4.67.1
triton==3.1.0
typing_extensions==4.12.2
tzdata==2024.2
ultralytics==8.1.27
urllib3==2.2.3

Minimal Reproducible Example

from ultralytics import YOLO
from sahi.auto_model import AutoDetectionModel
from sahi.predict import get_sliced_prediction
from supervision import InferenceSlicer, Detections
import time
from PIL import Image
import cv2
import numpy as np

sample_image_path = "image.jpg"
model_path = "yolov8n.pt"

yolo_model = YOLO(model_path, verbose=False)


# Function to measure obss/sahi inference
def run_obss_sahi_inference(image_path, model):
    detection_model = AutoDetectionModel.from_pretrained(
        model_type="yolov8", model_path=model_path, confidence_threshold=0.25
    )
    times = []
    for _ in range(20):
        start_time = time.time()
        result = get_sliced_prediction(
            image_path,
            detection_model=detection_model,
            postprocess_match_metric="IOU",
            postprocess_match_threshold=0.1,
            slice_height=256,
            slice_width=256,
            overlap_height_ratio=0.0,
            overlap_width_ratio=0.0,
        )
        end_time = time.time()
        times.append(end_time - start_time)
    return result, sum(times[3:]) / len(times[3:])


# Function to measure supervision inference
def run_supervision_inference(image, model):
    def callback(image_slice: np.ndarray) -> Detections:
        result = model(image_slice, conf=0.25, iou=0.1, device=0)[0]
        return Detections.from_ultralytics(result)

    inference_slicer = InferenceSlicer(
        callback=callback,
        slice_wh=(256, 256),
        overlap_wh=None,
        thread_workers=4,
    )
    times = []
    for _ in range(20):
        start_time = time.time()
        detections = inference_slicer(image)
        end_time = time.time()
        times.append(end_time - start_time)
    return detections, sum(times[3:]) / len(times[3:])


def main():
    img_array = cv2.imread(sample_image_path)
    obss_result, obss_time = run_obss_sahi_inference(img_array, yolo_model)
    supervision_result, supervision_time = run_supervision_inference(
        img_array, yolo_model
    )

    comparison_results = {
        "Implementation": ["obss/sahi", "supervision"],
        "Inference Time (s)": [obss_time, supervision_time],
    }

    print(comparison_results)


if __name__ == "__main__":
    main()

Additional

image

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@iokarkan iokarkan added the bug Something isn't working label Nov 28, 2024
@LinasKo
Copy link
Contributor

LinasKo commented Nov 28, 2024

Hi @iokarkan 👋

Thank you for a thorough report. It does not surprise me, as threads work weirdly when involving vision models / access to GPU. We'll look into it, but it might take some time.

Meanwhile, if you're keen on speed, we have an implementation that runs inference in bulk, on a GPU. It's not up-to-date, but if urgent, you might be able to hack it with a custom InferenceSlicer class or some monkeypatching.

#1239

Again, thank you. It's wonderful to receive such a thorough report, with full reproduction steps.

@iokarkan
Copy link
Author

Hi @LinasKo, thanks for the answer.

I'm interested in real-time scenarios, so single image per-iteration. However if batching refers to sending the all component slice (and/or original image, like SAHI does) as a batch to the GPU, it should be more performant. I'll take a look!

@LinasKo
Copy link
Contributor

LinasKo commented Dec 4, 2024

I'm closing this, as there's not much we can do besides batching, which is already covered by #1239 and related issues.

@LinasKo LinasKo closed this as completed Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants