I designed my own fast game streaming video codec – PyroWave

Daily Digest email

Get the top HN stories in your inbox every day.

Almondsetat

VC-2 is an intra-only wavelet-based ultra low latency codec developed by the BBC years ago for exactly this purpose. It is royalty free and currently the only implementations are in ffmpeg and in the official BBC repository, and are CPU based. I am planning to make a CUDA accelerated version for my master thesis, since the Vulkan implementations made at GSoC last year still suck quite a bit. I would suggest people to look into this codec

_kb

Definitely a neat codec! You can get COTS hardware en/decoders that use it via https://atlona.com/omnistream-av-over-ip/.

averne_

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.

Almondsetat

95% GPU usage but only x2 faster than the reference SIMD encoder/decoder

actionfromafar

What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?

The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.

Are there APIs which can sidestep the "load to CPU RAM" part?

Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?

Almondsetat

Some capture cards (Blackmagic comes to mind) have worked together with NVIDIA to expose DMA access. This way video frames are automatically transferred from the card to the GPU memory bypassing the RAM and CPU. I think all GPU manufacturers expose APIs to do this, but it's not that common in consumer products.

Const-me

> Are there APIs which can sidestep the "load to CPU RAM" part?

On windows that API is Desktop Duplication. The API delivers D3D11 textures, usually in BGRA8_UNORM format. When HDR is enabled you would need slightly different API method which can deliver HDR frames in RGBA16_FLOAT pixel format.

LtdJorge

On Linux you should look into GStreamer and dmabuf.

oplav

In your experience, how does VC-2 compare to JPEG XS from a quality perspective? The JPEG XS resources I’ve seen say JPEG XS has higher visual quality, but curious what it’s like in practice.

Almondsetat

JPEG-XS is an almost direct successor to VC-2. They use the same techniques and if you read JPEG-XS's whitepaper they explicitly cite VC-2 as an inspiration and a target to surpass. JPEG-XS is an improvement, there is not doubt about that, but unfortunately they decided to patent it for all uses. In both cases, the publicly available software implementations are very few, CPU-based, and the ones that aren't are implemented in hardware inside business AV solutions.

sippeangelo

I know next to nothing about video encoding, but I feel like there should be so much low hanging fruit when it comes to videogame streaming if the encoder just cooperated with the game engine even slightly. Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example. But there's probably some nasty patent hampering innovation there, so might as well forget it!

torginus

'Motion vectors' in H.264 are a weird bit twiddling/image compression hack and have nothing to do with actual motion vectors.

- In a 3d game, a motion vector is the difference between the position of an object in 3d space from the previous to the current frame

- In H.264, the 'motion vector' is basically saying - copy this rectangular chunk of pixels from some point from some arbitrary previous frame and then encode the difference between the reference pixels and the copy with JPEG-like techniques (DCT et al)

This block copying is why H.264 video devolves into a mess of squares once the bandwidth craps out.

pornel

Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.

In typical video encoding motion compensation of course isn't derived from real 3D motion vectors, it's merely a heuristic based on optical flow and a bag of tricks, but in principle the actual game's motion vectors could be used to guide video's motion compensation. This is especially true when we're talking about a custom codec, and not reusing the H.264 bitstream format.

Referencing previous frames doesn't add latency, and limiting motion to just displacement of the previous frame would be computationally relatively simple. You'd need some keyframes or gradual refresh to avoid "datamoshing" look persisting on packet loss.

However, the challenge is in encoding the motion precisely enough to make it useful. If it's not aligned with sub-pixel precision it may make textures blurrier and make movement look wobbly almost like PS1 games. It's hard to fix that by encoding the diff, because the diff ends up having high frequencies that don't survive compression. Motion compensation also should be encoded with sharp boundaries between objects, as otherwise it causes shimmering around edges.

CyberDildonics

Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.

3D motion vectors always get projected to 2D anyway. They also aren't used for moving blocks of pixels around, they are floating point values that get used along with a depth map to re-rasterize an image with motion blur.

robterrell

Isn't the use of the H.264 motion vector to preserve bit when there is a camera pan? A pan is a case where every pixel in the frame will change, but maybe doesn't have to.

superjan

Yes, or when a character moves across the screen. They are quite fine grained. However, when the decoder reads the motion vectors from the bitstream, it is typically not supposed to attach meaning to them: they could point to a patch that is not the same patch in the previous scene, but looks similar enough to serve as a starting point.

ChadNauseam

I think you're right. Suppose the connection to the game streaming service adds two frames of latency, and the player is playing an FPS. One thing game engines could do is provide the game UI and the "3D world view" as separate framebuffers. Then, when moving the mouse on the client, the software could translate the 3D world view instantly for the next two frames that came from the server but are from before the user having moved their mouse.

VR games already do something like this, so that when a game runs at below the maximum FPS of the VR headset, it can still respond to your head movements. It's not perfect because there's no parallax and it can't show anything for the region that was previously outside of your field of view, but it still makes a huge difference. (Of course, it's more important for VR because without doing this, any lag spike in a game would instantly induce motion sickness in the player. And if they wanted to, parallax could be faked using a depth map)

rowanG077

You can do parallax if you use the depth buffer.

WantonQuantum

A simple thing to start with would be akin to Sensor Assisted Video Encoding where phone accelerometers and digital compasses are used to give hints to video encoding: https://ieeexplore.ieee.org/document/5711656

Also, for 2d games a simple sideways scrolling game could give very accurate motion vectors for the background and large foreground linearly moving objects.

I'm surprised at the number of people disagreeing with your idea here. I think HN has a lot of "if I can't see how it can be done then it can't be done" people.

Edit: Also any 2d graphical overlays like HUDs, maps, scores, subtitles, menus, etc could be sent as 2d compressed data, which could enable better compression for that data - for example much sharper pixel perfect encoding for simple shapes.

derf_

> I think HN has a lot of "if I can't see how it can be done then it can't be done" people.

No, HN has, "This has been thought of a thousand times before and it's not actually that good of an idea," people.

The motion search in a video encoder is highly optimized. Take your side-scroller as an example. If several of your neighboring blocks have the same MV, that is the first candidate your search is going to check, and if the match is good, you will not check any others. The check itself has specialized CPU instructions to accelerate it. If the bulk of the screen really has the same motion, the entire search will take a tiny fraction of the encoding time, even in a low-latency, real-time scenario. Even if you reduce that to zero, you will barely notice.

On the other end of the spectrum, consider a modern 3D engine. There will be many things not describable by block-based motion of the underlying geometry: shadows, occlusions, reflections, clouds or transparency, shader effects, water or atmospheric effects, etc. Even if you could track the "real" motion through all of that, the best MV to use for compression does not need to match the real motion (which might be very expensive to code, while something "close enough" could be much cheaper, as just one possible reason), it might come from any number of frames (not necessarily the most recent), etc., so you still need to do a search, and it's not obvious the real motion is much better as a starting point than the heuristics an encoder already uses, where they even differ.

All of that said, some encoder APIs do allow providing motion hints [0], you will find research papers and theses on the topic, and of course, patents. That the technique is not more widespread is not because no one ever tried to make it work.

[0] https://docs.nvidia.com/video-technologies/video-codec-sdk/1... as the first random result of a simple search.

WantonQuantum

> If several of your neighboring blocks have the same MV

I think we’re mostly agreeing here. Finding the MVs in any block takes time. Time that can be saved by hints about the direction of motion. Sure, once some motion vectors are found then other blocks benefit by initially assuming similar vectors. To speed things up why not give the hints right away if they’re known a priori?

mikepurvis

I’ve wondered about this as well, like most clients should be capable of still doing a bit of compositing. Like if you sent billboard renders of background objects at lower fidelity/frequency than foreground characters, updated hud objects with priority and using codecs that prioritize clarity, etc.

It was always shocking to me that Stadia was literally making their own games in house and somehow the end result was still just a streamed video and the latency gains were supposed to come from edge deployed gpus and a wifi-connected controller.

Then again, maybe they tried some of this stuff and the gains weren't worth it relative to battle-tested video codecs.

toast0

For 2d sprite games, OMG yes, you could provide some very accurate motion vectors to the encoder. For 3d rendered games, I'm not so sure. The rendering engine has (or could have) motion vectors for the 3d objects, but you'd have to translate them to the 2d world the encoder works in; I don't know if it's reasonable to do that ... or if it would help the encoder enough to justify.

sudosysgen

Schemes like DLSS already do provide 2D motion vectors, it's not necessarily a crazy ask.

markisus

The ultimate compression is to send just the user inputs and reconstitute the game state on the other end.

w-ll

The issue is the "reconstitute the game state on the other end" when it comes to at least how I travel.

I haven't in a while but I used to use https://parsec.app/ on a cheap intel Air to do my STO dailies on vacation. It sends inputs, but gets a compressed stream. Im curious of any OS of something similar.

Zardoz84

Good old DooM save demos are essentially this.

cma

> Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example.

Doesn't work for translucency and shader animation. The latter can be made to work if the shader can also calculate motion vectors.

WithinReason

Instead of motion vectors you probably want to send RGBD (+depth) so the client can compute its own motion vectors based on input, depth, and camera parameters. You get instant response to user input this way, but you need to in-paint disocclusions somehow.

_kb

This is a really nice walkthrough of matching trade offs to acceptable distortions for a known signal type. Even if you’re selecting rather than designing a codec, it’s a great process to follow.

For those interesting in the ultra low latency space (where you’re willing to trade a bit of bandwidth to gain quality and minimise latency), VSF have a pretty good wrap up of other common options and what they each optimise for: https://static.vsf.tv/download/technical_recommendations/VSF...

keketi

Have an LLM transcribe what is happening in the game into a few sentences per frame, transfer the text over network and have another LLM reconstruct the frame from the text. It won't be fast, it's going to be lossy, but compression ratio is insane and it's got all the right buzzwords.

jameshart

Frame 1:

You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here.

nusl

They did this :P

https://www.youtube.com/watch?v=ZpCrBBj6AWE

Eduard

(user input: mouse delta: (-20, -8))

Frame 2:

A few blades of grass sway gently in the breeze. The camera begins to drift slightly, as if under player control — a faint ambient sound begins: wind and birds.

taneq

Ah, this explains why there are clowns under the bed and creepy children staring at me from the forest.

Y_Y

kill jester

cyclotron3k

Send the descriptions via the blockchain so there's an immutable record

poglet

Maybe even one day we reach point where the game can run locally on the end users' machine.

foota

You've got my attention

raphman

Very cool - That's nearly exactly what I need for a research project.

FWIW, there's also the non-free JPEG-XS standard [1] which also claims very low latency [2] and might be a safer choice for commercial projects, given that there is a patent pool around it.

[1] https://www.jpegxs.com/

[2] https://ds.jpeg.org/whitepapers/jpeg-xs-whitepaper.pdf

jamesfmilne

JPEG-XS is great for low latency, but it uses more bandwidth. We're using it for low-latency image streaming for film/TV post production:

https://www.filmlight.ltd.uk/store/press_releases/filmlight-...

We currently use the IntoPIX CUDA encoder/decoder implementation, and SRT for the low-level transport.

You can definitely achieve end-to-end latencies <16ms over decent networks.

We have customers deploying their machines in data centres and using them in their post-production facilities in the centre of town, usually over a 10GbE link. But I've had others using 1GbE links between countries, running at higher compression ratios.

indolering

A patent pool doesn't make you safer: it's just a patent troll charging you to cross the bridge. They are not offering insurance against more patent trolls blackmailing you after you cross the bridge.

raphman

While I am personally opposed to software patents, I'd argue that the JPEG XS patent holders [1] are not 'patent trolls' in any meaningful sense of the word.

While I have no personal experience on that topic, I'd assume that a codec with a patent pool is a safer bet for a commercial project. Key aspects being protected by patents makes it less likely that some random patent troll or competitor extorts you with some nonsense patent. Also, using e.g., JPEG XS instead of e.g., pyrowave also ensures that you won't be extorted by the JPEG XS patent holders.

One may call this a protection racket - but under the current system, it may make economical sense to pay for a license instead of risking expensive law suits.

[1] https://www.jpegxspool.com/

rcxdude

>Key aspects being protected by patents makes it less likely that some random patent troll or competitor extorts you with some nonsense patent

Does it? how? Patents can overlap, for example. Unless there's some indemnity or insurance for fighting patent lawsuits as part of the pool, it's a protection only against those patent holders, not other trolls.

Thaxll

There is the creator of VLC that is working on something similar, very cutting edge.

https://streaminglearningcenter.com/codecs/an-interview-with...

Ultra low latency for streaming.

https://www.youtube.com/watch?v=0RvosCplkCc

torginus

Having worked in the space, I'd have to say hardware encoders and H.264 is pretty dang good - NVENC works with very little latency (if you tell it to, and disable the features that increase it, such as multiple frame prediction, B-frames).

The two things that increase latency are more advanced processing algorithms, giving the encoder more stuff to do, and schemes that require waiting multiple frames. If you go disable those, the encoder can pretty much start working on your frame the nanosecond the GPU stops rendering to it, and have it encoded in <10ms.

Wowfunhappy

> have it encoded in <10ms.

For context, OP achieved 0.13 ms with his codec.

pjc50

"0.13 ms on a RX 9070 XT on RADV."

"interesting data point is that transferring a 4K RGBA8 image over the PCI-e bus is far slower than compressing it on the GPU like this, followed by copying over the compressed payload."

"200mbit/s at 60 fps"

It's certainly a very different set of tradeoffs, using a lot more bandwidth.

torginus

I don't have the timings right now but you can go significantly below 10ms.

There's a tradeoff between quality and encoding time - for example, if you want your motion vector reference to go back 4 frames, instead of 2, then the encoder will take longer to run, and you get better quality at no extra bitrate, but more runtime.

If your key to-screen latency has an irreducible 50-60ms part of rendering, processing, data transfer, decoding and display, then the extra 10ms is just 15% more latency, but you have to find the correct tradeoff for yourself.

your_challenger

But isn't the OP talking about local network while Jean-Baptiste Kempf is talking about the internet?

dishsoap

10ms is quite long in this context.

RobRivera

>10 ms

Do not shame this dojo.

latchkey

Sadly appears to be unavailable.

Cadwhisker

This CODEC uses the same base algorithm as HTJ2K (High-Throughput JPEG 2000).

If the author is reading this, it would be very interesting to read about the differences between this method and HTJ2K.

Fidelix

Unbelievable... Good job mate.

Can't wait until one day this gets into Moonlight or something like it.

cpeth

Exactly what I was thinking. Wish I had the time and expertise to give adding support for this codec myself a go. Streaming Clair Obscure over my LAN via Sunshine / Moonlight is exactly my use-case and the latency could definitely be better.

CharlesW

> Given how niche and esoteric this codec is, it’s hard to find any actual competing codecs to compare against.

It'd be interesting to see benchmarks against H.264/AVC (see example "zero‑latency" ffmpeg settings below) and JPEG XS.

  -c:v libx264 -preset ultrafast -tune zerolatency \
  -x264-params "keyint=1:min-keyint=1:scenecut=0:rc-lookahead=0" \
  -bf 0 -b:v 8M -maxrate 8M -bufsize 1M

freshtake

If you're focused solely on local network streaming, you can throw most of the features of modern codecs out the window. The trade-off is bandwidth, but if the network can support 100 Mbps, you can get remarkably low latency with relatively little processing.

For example, Microsoft's DXT codec lacks most modern features (no entropy coding, motion comp, deblocking, etc.), but delivers roughly 4x to 8x compression and is hardware decodable (saving on decoding and presentation latency).

Of course, once you've tuned the end to end capture-encode-transmit-decode-display loop to sub 10 ms, you then have to contend with the 30-100 ms of video processing latency introduced by the display :-)

kookamamie

Not bad. The closest competition would be NDI from NewTek, now Vizrt. It targets similar bitrate and latency ranges.

monster_truck

Looks like NDI without any of the conveniences.

You're doing something wrong if nvenc is any slower, the llhp preset should be all you need.

Daily Digest email

Get the top HN stories in your inbox every day.