Optimized decoder for WebAssembly #4

thelamer · 2022-12-16T19:11:34Z

Feel free to simply close out this issue if you are not interested but we just implemented QOI image format for VNC to deliver lossless remote desktops using Rust WASM clientside here:
https://github.com/kasmtech/noVNC/tree/master/core/decoders/qoi
Some docs here:
https://www.kasmweb.com/docs/latest/how_to/lossless.html

I have been wondering if SIMD optimizations were even possible on the server side for some time now, I tried out the stable branch with ssse3 and did see +- 10% in encoding speed vs rapid qoi depending on what image you feed to it. Looks like offloading the hashing has some promise especially once the AVX stuff is implemented.
Though I am specifically reaching out if you think the decoding could be sped up in a web browser? The compiled blob linked earlier in noVNC is a modified version of this implementation:
https://github.com/lukeflima/qoi-viewer

This is all functional, but under high load scenarios you need a pretty beefy client to maintain FPS at a gigabit. Even a small improvement on the web assembly side would have a large impact on overall smoothness of desktop delivery. Anything we do for desktop delivery is open source including these changes if possible.

Essentially I am wondering if you would be interested in some side work to put together a highly optimized open source WASM qoi decoder that takes a Uint8Array as input and spits back "ImageData" as a uint8clamped array and size information. We do 24 bit qoi without the alpha channel.

Borketh · 2022-12-28T21:41:42Z

Hi @thelamer !
WASM is a planned target for optimizations, but I haven't looked in to it much yet. I was working on some further optimizations and restructuring on the ssse3 and x86-64 part in general, but I may have lost my work (currently trying to find it on a potentially borked disk image as we speak, oh boy). My road map was basically to work my way up the features sets of x86 before moving on to ARM, and then potentially WASM.

I don't know all that much about the latter two platforms, but I went in to x86 stuff without knowing anything either, so I can just learn the same way. What I am aware of, however, is that Rust only targets wasm32 at the moment (correct me if I'm wrong). Some of the optimizations (including the single-pixel hash function inspired by rapid-qoi) depend on being within a 64-bit integer, so those would have to be stripped. I also don't know the extent of the range of SIMD options there are in WASM. I assume that they have to be more general to make them platform-independent. There may not be some of the instructions I would need to optimize easily, but I can definitely try. The potential is potentially there, I think (lol).

If you want, I can try WASM after I finish x86 (after I'm done the base stuff and ssse3, the rest of the instructions won't take very long).

thelamer · 2023-01-11T23:02:05Z

@AstroFloof sorry for the delay missed this ping. Yes SIMD instructions in WebAssembly is very limited.
I am not very low level, my programming experience has mostly revolved around my work at Linuxserver.io building out web apps in my spare time. When it comes to making something new generally that is plugging off the shelf components into each other as was done with existing wasm qoi decode logic and the noVNC project.

So in this case my focus is on any optimizations even if small that could be made to the decoding reference implementation I used which directly translates to lower CPU clientside and higher FPS. Thought I would reach out to anyone trying to improve the QOI v1 spec for decoding/encoding which is a very small list. Right now performance is pretty good on higher end modern CPUs:

fpsdemo.mp4

An easy way to see this first hand would be to run docker run --rm -it --shm-size=512m -p 6901:6901 -e VNC_PW=password kasmweb/ubuntu-focal-desktop:1.12.0
https://localhost:6901
user: kasm_user
pass: password
Then swap to lossless under settings > stream quality. (use Chomium based browser for best results)

From a development standpoint it would just involve building and swapping out the wasm blob and function names in:
https://github.com/kasmtech/noVNC/blob/master/core/decoders/qoi/decoder.js#L256-L277 and seeing if it can eek out anymore FPS. (inside the docker image at /usr/share/kasmvnc/www/core/decoders/qoi/)

If you setup a Github Sponsorship on your account I would be happy to toss you some money just for looking into it. I'm interested if it is even possible.

Borketh · 2023-01-17T19:46:13Z

Hello again @thelamer !

I'm honoured that you'd consider sponsoring this project, and I would love to work on a WASM decoder (and later an encoder). I should warn you not to expect anything, however. I don't know anything about WASM yet (although I knew nothing about x86 before starting the project, so I will learn, of course), and I'm not sure how long it'll take me to make a MVP to start optimizing. Additionally, most if not all of the SIMD-related optimizations are focused on hashing every pixel before beginning the encoding process. I do know of techniques I've used to speed up decode that I can attempt to use though. Whatever the final product may be, I'll certainly do my best!

I intend to release a new version of the x86-64 encoder/decoder very soon, so I'll start after that.

Feyko · 2023-01-20T20:32:18Z

Hey @thelamer! I'm a friend of Floof that introduced him to QOI and followed his hardqoi developments since
At one point I wondered if a QOI GPU endec was possible but abandoned the idea after learning about the relative high latency for GPU API initialisation. This however wouldn't be a problem if the API context is reused when used for something like remote desktop streams. There are (many) other challenges with QOI on GPU but it seems like something worth looking into
I'll admit I got nerd-sniped and may try to revive the idea, though I am confused why you'd use an image format like QOI instead of a video format which grants higher compression
Could you explain why you made that choice? I'm available on Discord (Feyko#7953) if you want a more interactive chat

Feyko · 2023-01-23T20:33:00Z

Welp, quick update on this. Did some more investigating and I think QOI on GPU really is a dead-end :P

thelamer · 2023-05-10T20:47:04Z

@AstroFloof so I have been pumping decode code through different AI language models and some of them seem to think the hashing is not needed and it can be more efficiently performed with an array. This is all greek to me but does this make any sense to you?

use std::io::Read;

fn decode_qoi(reader: impl Read) -> Result<Vec<u8>, std::io::Error> {
    let mut buf = vec![0; 16];
    reader.read_exact(&mut buf)?;

    let magic = u32::from_be_bytes(buf[0..4].to_vec());
    let width = u32::from_be_bytes(buf[4..8].to_vec());
    let height = u32::from_be_bytes(buf[8..12].to_vec());
    let channels = buf[12] as u8;
    let colorspace = buf[13] as u8;

    if magic != 0x716f6966 {
        return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI magic"));
    }

    if width == 0 || height == 0 || channels < 3 || channels > 4 || colorspace > 1 {
        return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI header"));
    }

    let mut pixels = vec![0; width * height * channels];

    let mut run = 0;
    let mut prev_color = [0; 4];
    for i in 0..width * height {
        if run > 0 {
            run -= 1;
            continue;
        }

        let op = reader.read_u8()?;
        match op {
            0xfe => {
                prev_color[0] = reader.read_u8()?;
                prev_color[1] = reader.read_u8()?;
                prev_color[2] = reader.read_u8()?;
            }
            0xff => {
                prev_color[0] = reader.read_u8()?;
                prev_color[1] = reader.read_u8()?;
                prev_color[2] = reader.read_u8()?;
                prev_color[3] = reader.read_u8()?;
            }
            0x00..=0x3f => {
                let index = op as usize;
                for j in 0..channels {
                    pixels[i * channels + j] = prev_color[j] + index;
                }
            }
            0x40..=0x7f => {
                let index = (op & 0x3f) as usize;
                for j in 0..channels {
                    pixels[i * channels + j] = prev_color[j] + ((op >> 4) & 0x03) - 2;
                }
            }
            0x80..=0xbf => {
                let value = (op & 0x3f) - 32;
                for j in 0..channels {
                    pixels[i * channels + j] = value;
                }
            }
            0xc0..=0xff => {
                let runs = op & 0x3f;
                run = runs;
            }
            _ => {
                return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI op"));
            }
        }
    }

    Ok(pixels)
}

Borketh · 2023-06-02T17:44:53Z

This doesn't make any sort of sense. LLMs just predict the next token and have no idea what they're doing.

thelamer · 2023-06-02T19:49:26Z

This doesn't make any sort of sense. LLMs just predict the next token and have no idea what they're doing.

Thanks for taking a look, I figured as much I could not get this to run.

Borketh self-assigned this Jan 17, 2023

Borketh added this to the WASM milestone Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized decoder for WebAssembly #4

Optimized decoder for WebAssembly #4

thelamer commented Dec 16, 2022

Borketh commented Dec 28, 2022

thelamer commented Jan 11, 2023

Borketh commented Jan 17, 2023

Feyko commented Jan 20, 2023

Feyko commented Jan 23, 2023

thelamer commented May 10, 2023

Borketh commented Jun 2, 2023

thelamer commented Jun 2, 2023

Optimized decoder for WebAssembly #4

Optimized decoder for WebAssembly #4

Comments

thelamer commented Dec 16, 2022

Borketh commented Dec 28, 2022

thelamer commented Jan 11, 2023

Borketh commented Jan 17, 2023

Feyko commented Jan 20, 2023

Feyko commented Jan 23, 2023

thelamer commented May 10, 2023

Borketh commented Jun 2, 2023

thelamer commented Jun 2, 2023