Saturday, November 12, 2016

The Components of Motion-to-Photon Latency in Virtual Reality Systems

Like I wrote in an earlier blog post [EC], Virtual Reality (VR) requires a low motion-to-photon latency, which is the time needed for user movement to be fully reflected on the display.  Low latency is critical to delivering an engaging and comfortable VR experience. In real life, the motion-to-photon latency is essentially zero since our sensory system and motor systems are tighly coupled [Giz]. Most people agree that if the motion-to-photon latency is below 20ms, the lag is no longer perceptible [Oculus]. The latency that Oculus Rift Development Kits have achieved (I didn’t find any figures for the consumer version of the Rift) is typically in the range from 30ms to 50ms, including time for sensing, data arrival over USB, sensor fusion, game simulation, rendering, and video output. Sony’s PlayStation VR achieves a latency of less than 18ms [PS]. Thus, it appears that in the case of local rendering of VR content, the problem with motion-to-photon latency has been solved at least for most of the users. I am writing ”most of the users” since research has shown that the sensitivity to lag varies wildly – the most sensitive persons can notice lags of 3.2ms, whereas the least sensitive persons can accept hundreds of milliseconds [Giz].

What about live streaming of VR video? The way most solutions that stream VR video operate appears to be that they stream the full 360 degree spherical video to the end user, whose computer extracts the viewport and displays it at the HMD (Head-Mounted Display). This is how for instance VR live streaming on Microsoft Azure Media Services works [MS]. Another alternative that Facebook seems to be using is not to stream the entire 360 degree video, but slightly more than the visible field-of-view (FOV), which makes it possible for the user’s device to react to head movement locally [FB].

When streaming the full spherical video to the user, motion-to-photon latency is less of a problem since the user’s device can account for the head movement locally by presenting the subset of the 360 degree image that is within the user’s FOV. The downside of streaming the full spherical video is that it wastes bandwidth since most of the content is not within the user’s FOV. Also, if only the viewport was streamed to the user, a higher resolution for the FOV could be achieved. Thus, an attractive option would be for the remote server to stream only the exact viewport to the user. This, however, might not be entirely realistic if a 20ms motion-to-photon latency is desired. As an example, the Vahana VR live VR streaming solution from VideoStitch has an approximate delay of 300ms between a camera and a SDI TV connected to the SDI output of Vahana VR [VS]. Note that this delay does not include the delivery of the VR video over a network or CDN – if that were included, the total end-to-end delay would be approximately between 5-30s depending on the cache configuration of the CDN [VS].

So what are all the components of the end-to-end latency when streaming live video from a remote camera to a VR headset? These appear to include at least the following:

  • Camera latency - this is the latency from scene capture to start of video raster. According to [DP], the latency can be 5ms for 1080p video frame.
  • Image capture – the latency taken to ingest a video frame [DP]. Digital cameras use either CCD (Charge-Coupled Device) or CMOS (Complementary Metal Oxide Semiconductor) image sensors to convert what the lens sees into a digital format and copy it to memory. The capture frequency of the sensor defines how many exposures the sensor delivers per time unit, that is, how many frames it can capture per minute [Axis]. For instance a capture rate of 60fps means that the sensor captures one frame every 16.7ms, that is, the capture latency is 16.7ms. 
  • Image enhancements – once the raw image has been captured, each frame goes through a pipeline of enhancement processing such as de-interlacing, scaling and image rotation. Each of these steps adds latency. The higher the resolution, the more pixels the processor needs to encode. However, the increase in processing time for a higher resolution can be balanced by a faster processing unit in high resolution cameras. According to [Cast], the capture post-processing latency is less than 0.50ms for 1080p30 video in a carefully designed low-latency video system that uses hardware codecs.
  • Transfer delay occurs when the frame needs to be sent from the camera over an interface. For 1080p video, this delay can be 10ms for PCIe transfer [DP]. Transmission over a USB interface may require 7ms per frame [Eberlein].
  • Stitching, which is the process of combining multiple images with overlapping fields of view to produce a panorama image. According to VideoStitch, their Vahana VR live streaming solution takes approximately between 25ms and 35ms to stitch a single image [VS]. Nokia’s OZO Live (real-time 360 stitching software) adds a 30-frame (1 second) delay by default (for 4K@30fps video) [Nokia].
  • Encoding – this is about accessing the captured picture from memory, encoding it, and providing the encoded picture in memory. More advanced compression algorithms produce a higher latency. However, while for instance H.264 is more advanced than MJPEG (which compresses each video frame separately as a JPEG image), the difference in latency during encoding is only a few microseconds. According to [Cast], the latency introduced by video coding, not including buffering, can be as low as 0.5ms for 1080p30 video in a low-latency video system. Based on [Axis], it can take 1ms for a camera to scale and encode a 1080p image.
  • Buffering – the encoder uses an input buffer and output buffer that contribute delay [Tiwari]. The input buffer is filled up with data which is sent to the codec for processing. Some coding formats for H.264 use an input buffer of one frame. The output buffer is needed since the encoder generates a variable size of encoded bitstream for each frame. If transmission is done at constant or regulated bit rate, bits need to be stored in an encoder output buffer. Encoded data is placed in the output buffer, whose content is consumed when it is full. The size of the output buffer can vary from a number of frames (e.g., more than 30) to a sub-frame (e.g., ¼ frame, that is, 8.3ms for 30fps video), meaning that the latency introduced by the buffer can vary from milliseconds up to one second [Cast]. 
  • Packetization & send – here the the encoded picture is packetized and the packetized data is sent over the network. According to [Cast], network processing such as RTP/UDP/IP encapsulation can take as little as 0.01ms or less for 1080p30 video in a low-latency video system.
  • Network delay – transmission of the packets over the best-effort Internet adds latency, jitter, and packet loss. Each hop in the Internet over which a packet goes introduces propagation delay, transmission delay, processing delay, and queuing delay [HPBN]. In the case of the propagation delay, if assuming a single-hop fiber-optic cable between New York and San Francisco, the propagation delay for a packet over that link would be 21ms [HPBN]. The transmission delay is the amount of time required to push all the packet’s bits into the link, which is a function of the packet’s length and the data rate of the links. Processing delay is the amount of time required to process the packet header, check for bit-level errors, and determine the packet’s destination. Queuing delay is the time the packet spends waiting in the router’s queue until it can be processed.
  • Receiving and de-packetization – in this step, the packetized data is received over the network, de-packetized, and provided in memory.
  • Decoding  –the encoded picture is accessed from memory, decoded, and the decoded picture is provided in memory. The decoding latency depends on what hardware decoder support is present in the graphics card. It is typically faster to decode in hardware than in software due to latency overheads related to memory transfers and task-level management from the operating system. In a low-latency video system, the decompression delay could be as low as 0.5ms [Axis]. However, to ensure that the decoder does not ”starve” and that uniformed frame rate can be viewed to the user, a playout buffer is used to compensate for variations introduced by the network. This buffer contributes to the latency on the client side. The decoder buffer is the dominant latency contributor in most video streaming applications [D&R]. The latency added by the decoder buffer could vary from a number of frames (e.g., more than 30) to sub-frame (e.g., ¼ frame) [Axis]. In VoIP and video conferencing systems, the decoder buffer (a.k.a., jitter buffer) can introduce 30-100ms of latency (many systems use an adaptive jitter buffer). More information about jitter buffers is available in this blog post.
  • Display delay  – includes copying the decoded picture from memory, encoding and serializing the stream of pixel color information (could be 1920x1080 pixels where each pixel requires 30bits of data – 8 bits for red, green and blue + bits required by TMDS (Transition Minimized Differential Signaling) [Syn], which is a technology used for transmitting high-speed serial data in HDMI) and sending it via via TMDS to the monitor, decoding the signal at the receiving device, and displaying the picture (frame/field) on the monitor [SE]. Based on [Cast], display pre-processing can take around 0.5ms. The display refresh rate also adds to the delay. For computer monitor frames the refresh rate is around 14-15ms [Axis]. Special gaming monitors can have a refresh rate of 4-5ms.
So how large of a motion-to-photon latency will we end up with if we add up all of the components above (assuming 1080p60fps video)?
  • Camera: 5ms
  • Capture: 16.7ms
  • Image enhancements: 0.5ms
  • Transfer: 10ms
  • Stitching: 25ms
  • Encoding: 1ms
  • Buffering at encoder: 16.7ms (full frame buffering)
  • Packetization & send: 0.01ms
  • Network delay: 10ms (assuming an LTE network)
  • Buffering at decoder: 30ms (assuming low level of jitter in the network)
  • Decoding: 0.5ms
  • Display pre-processing: 0.5ms
  • Display refresh rate: 16.7ms
  • Total: 132.6ms (which is missing e.g., operating system overhead)
One real-life example of end-to-end motion-to-photon latency is VideoStitch Vahana live VR software, which has a delay of 300ms between a camera and TV through SDI (i.e., not including all the steps listed above) [VS].

Note that not all of the components above are relevant in a scenario where the user is for instance streaming 360 video from a remote live concert to his or her VR headset, even if the viewport is created by a server in the cloud. In that scenario, it is important to minimize the latency associated with the feedback loop between the sensors on the VR headset capturing head movement and the viewport generated by the remote server reflecting the movement. The latency between the remote server and the remote cameras is typically less important. However, low end-to-end latency between the remote camera and the VR headset can be critical in a use case such as remote control of a drone that is flying at high speed.

[Axis] Latency in live network video surveillance, http://www.axis.com/files/whitepaper/wp_latency_live_netvid_63380_external_en_1504_lo.pdf

[Cast] Video Streaming with Near-Zero Latency Using Altera Arria V FPGAs and Video and Image Processing Suite Plus the Right Encoder, https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-cast-low-latency.pdf

[D&R] Understanding - and Reducing - Latency in Video Compression Systems, http://www.design-reuse.com/articles/33005/understanding-latency-in-video-compression-systems.html

[DP] Live Capture with Parallel Processing, https://www.datapath.co.uk/tbd/whitepapers/datapath_low_latency.pdf

[Eberlein] Understanding video latency, http://www.vision-systems.com/content/dam/VSD/solutionsinvision/Resources/Sensoray_video-latency_article_FINAL.pdf

[EC] Nokia Believes VR and AR Will Require A Highly Distributed Network, http://edge-of-cloud.blogspot.fi/2016/10/nokia-believes-vr-and-ar-will-require.html

[FB] Next-generation video encoding techniques for 360 video and VR, https://code.facebook.com/posts/1126354007399553/next-generation-video-encoding-techniques-for-360-video-and-vr/

[Giz] The Neuroscience of Why Virtual Reality Still Sucks, http://gizmodo.com/the-neuroscience-of-why-vr-still-sucks-1691909123

[HPBN] Primer on Latency and Bandwidth, https://hpbn.co/primer-on-latency-and-bandwidth/

[MS] Virtual reality live streaming on Azure Media Services, https://azure.microsoft.com/en-us/blog/live-virtual-reality-streaming/

[Nokia] How much delay does OZO Live add? How can I adjust the delay of OZO Live? https://support.ozo.nokia.com/s/article/ka058000000U1uTAAS/How-much-delay-does-LIVE-add-How-can-I-adjust-the-delay-of-LIVE

[Oculus] The Latent Power of Prediction, https://developer3.oculus.com/blog/the-latent-power-of-prediction/

[PS] PlayStation VR: The Ultimate FAQ, http://blog.us.playstation.com/2016/10/03/playstation-vr-the-ultimate-faq/

[SE] How does a GPU/CPU communicate with a standard display output? (HDMI/DVI/etc), http://electronics.stackexchange.com/questions/102695/how-does-a-gpu-cpu-communicate-with-a-standard-display-output-hdmi-dvi-etc

[Syn] Understanding HDMI 2.0: Enabling the Ultra-High Definition Experience, https://www.synopsys.com/Company/Publications/SynopsysInsight/Pages/Art3-hdmi-2.0-IssQ3-13.aspx?cmp=Insight-I3-2013-Art3

[Tiwari] Buffer-contrained rate control for low bitrate dual-frame video coding, http://code.ucsd.edu/pcosman/TiwariM_icip08.pdf

[VS] What is the latency to stream a 360 video? http://support.video-stitch.com/hc/en-us/articles/203657056-What-is-the-latency-to-stream-a-360-video-

No comments:

Post a Comment