The Edge of the Cloud - 5G Technology Blog: November 2016

Saturday, November 19, 2016

What Are Ideal Refresh and Frame Rates for VR?

Refresh rate refers to the number of times a complete image is drawn on the screen per second [VE]. It is different from frame rate - the refresh rate may include the repeated drawing of identical frames, whereas frame rate measures how often a video source can feed a new (non-repeated) frame to a display [WP2].

The refresh rate of a VR headset and the frame rate are some of the factors that can cause motion sickness (a.k.a., cybersickness or simulator sickness) if they are not high enough. This can be due to multiple reasons, including at least flicker and sensory mismatch.

Flicker

The human eye perceives a stable image without flicker artifacts when a display updates at a sufficiently fast rate, which is called the flicker fusion rate [Nature]. A so-called critical flicker fusion rate is defined as the rate at which human perception cannot distinguish modulated light from a stable field. This rate varies with intensity and contrast with the fastest variation in luminance one can detect at 50-90Hz. According to standards for display ergonomics, a refresh rate of 72Hz for computer displays is sufficient to avoid flicker completely. Sensitivity to flicker is different between the fovea (a small, central pit composed of closely packed cones in the eye that is located in the center of the macula lutea of the retina [WP3]) and peripheral vision (i.e., the edges of the field of view). According to [LaViola], a refresh rate of 30Hz is usually good enough to remove perceived flicker from the fovea. However, the human eye is most sensitive to flicker in its peripheral vision, meaning that the periphery requires higher refresh rates.

If the refresh rate of a display is too slow, the user can perceive flicker of the entire virtual environment (a.k.a., judder), which can lead to motion sickness [WP1]. The brain can begin to experience judder (especially in the periphery) if the frame rate drops below 65fps [Nature], 75fps [MTG] or 90fps depending on which source one believes. According to Oculus, 75Hz panels are fast enough that the majority of users will not perceive any noticeable flicker [Oculus]. The reason different sources refer to different values might be that the number of frames per second detectable by humans ranges considerably from one person to another [LI]. Furthermore, people can also be trained to more quickly perceive visual information. A recent MIT study found that participants can identify images seen for as briefly as 13ms, which is about 77fps [MIT]. An U.S. Air Force study has shown that fighter pilots can identify planes seen for as little as 4.55ms, which would mean about 220fps [AMO].

Recent studies [Nature] have shown that sensitivity to flicker drops to zero near 65Hz, but only when the modulated light source is spatially uniform. When the modulated light source contains a spatial high frequency edge (a high frequency edge could in the simplest case be represented by an image which is first bright on the left half of the frame and black on the right and then inverted – a more complex example is natural images since they have many edges), all test subjects studied in [Nature] saw flicker artifacts over 200Hz and several of them reported visibility of flicker artifacts at over 800Hz. For the median viewer, flicker artifacts disappear only above 500Hz, meaning that the median viewer can distinguish between modulated light and a stable field at up to 500Hz. This is most likely due to unconscious rapid eye movements across high frequency edges in the displayed image. The implication of these findings is that modern display designs which use complex spatio-temporal coding need to update much faster than conventional TVs.

It should be noted that the impact of flicker is different when viewing a TV, computer monitor, or displays of a VR headset. Low refresh rates in computer monitors, which are viewed up close, will produce a noticeable screen flicker because the display fills a larger proportion of a person’s field of vision than a TV screen that is typically viewed from a distance [VE]. And VR headsets are even more challenging than computer monitors since they fill an even larger portion of or even the entire field of view of the user.

Despite of their many advantages for VR (e.g., low persistence), OLED displays produce some amount of flicker, similar to CRT displays [Oculus]. When a LED monitor is set to maximum brightness, its LEDs are typically glowing at full strength [FP]. If the brightness is reduced, the LEDs need to emit less light. This is achieved by inserting small pauses during which the LEDs turn off for a very short time. The same happens with CCFL (Cold Cathode Fluorescent Lamp) backlight LCD displays but CCFLs have a much longer afterglow than LEDs that turn off instantly. Thus, CCFLs are easier on the eyes than LEDs. Most LED-backlight monitors use Pulse Width Modulation (PWM) with a frequency ranging from 90Hz to 400Hz.

Sensory Conflicts

Cybersickness is believed to occur primarily as a result of conflicts between three sensory systems: visual, vestibular and proprioceptive [UQO]. The vestibular system is a complicated sensory system in the inner ear that provides balance and spatial orientation. The proprioceptive system is located primarily in the cerebellum [CS]. Proprioceptive information comes from receptors in the muscles, joints, and bones. In normal situations, the information coming from the visual, vestibular, and proprioceptive systems is in agreement.

The first example of conflicts and errors in the vestibular-ocular reflex is the lag between head movements and corresponding updates on the display. In this case, the eyes perceive movement that is out of sync by a few milliseconds with what is perceived by the vestibular system. Here, the solution is to use high frame rate rendering possibly together with other techniques such as high-precision low-latency tracking and low-persistence displays (low persistence works by illuminating the display only briefly, then turning it off until the next frame is ready in contrast to keeping it illuminated continuously from one frame to the next [RV]). These technologies can reduce motion sickness since they minimize the mismatch between a user’s visual perception of the virtual environment and the response of their vestibular system [Fernandes]. As an example, OLED displays have low persistence, which significantly reduces blur during head movement. Oculus claims that the low-persistence display of the Rift eliminates motion blur and judder almost completely.

The second and perhaps more challenging example is VR users who do not or cannot move physically the same way they move virtually. In this case, high frame rate, high-precision low-latency tracking and low-persistence displays do not help. What happens in such situations is that while the eyes indicate that the person has moved, the vestibular and proprioceptive systems indicate that the person has not moved. One example is when the character is walking in the virtual environment while the player is in reality standing still or sitting. This is the inverse of the situation that occurs when a person is reading in a moving vehicle – in this case the eyes perceive no movement while the vestibular and proprioceptive systems do.

There is at least one potential solution to the mismatch between the visual system perceiving motion while the vestibular and proprioceptive systems don’t. A nonprofit medical practice and research group called Mayo Clinic has patented a new technology called Galvanic Vestibular Stimulation (GVS) which synchronizes the inner ear to what a person is viewing [Forbes]. GVS uses strategically-placed electrodes to trick the inner ear into perceiving motion [TC]. The electrodes are used to send specific electrical signals to nerves [Samsung]. Four such electrodes are placed behind each ear, on the forehead and at the nape of the neck. The electrodes are then all linked in real-time so that any movement in the visual field launches a synchronized GVS command. Besides vMocion, also Samsung has been using GVS. They recently showed off a special project they have been working on, called the Entrim 4D headphones [Samsung].

Conclusion

So how high a refresh rate and frame should a VR system offer? Oculus Rift and HTC Vive have a 90Hz refresh rate and a frame rate of 90fps, whereas PlayStation VR can provide up to 120Hz and 120 fps [PS]. According to AMD’s Radeon Technologies Group, in order for VR to reach true immersion that one will not be able to tell apart from real world, 240Hz and 240fps are required [TT]. However, even this may not be enough as indicated by the above-mentioned study that showed that some persons can detect flicker artifacts even at above 800Hz.

Nvidia has debuted a working version of a 1700Hz display [DT]. The display can maintain a stable image even when zoomed in with a microscope and shaken vigorously when combined with low-latency input. Doing the same on a 90Hz display such as the ones used by today’s VR headsets would result in a lot of blurring. A 90Hz display refreshes the image every 11.1ms, whereas Nvidia’s display does so every 0.59ms. Thus, a 1700Hz display in practice eliminates any lag from the display, which would help bring the motion-to-photon latency down. Nvidia believes that while a less than 20ms motion-to-photon latency is generally considered good enough for VR, things get better towards 10ms and there are even measurable benefits with a 1ms latency [RV].

Thus, to eliminate motion sickness for even the most sensitive users and enable 1ms-level motion-to-photon latency, a (close to) zero-latency display operating at 1700Hz would definitely be beneficial. However, that would naturally need also a GPU that can provide input (ideally, something like 16K video for each eye [EC]) at 1700fps, which is something we will not be seeing any time soon.

References

[AMO] Human Eye Frames Per Second, http://amo.net/NT/02-21-01FPS.html

[CS] Tactile, Vestibular and Proprioceptive Senses, http://cherringtonsawers.com/tactile-vestibular-and-proprioceptive-senses.html

[DT] Nvidia’s Prototype 1,700Hz Display Could Unlock Frame Rates for Future VR, http://www.digitaltrends.com/virtual-reality/nvidia-1700hz-vr-display/

[EC] What Kind of a Resolution Is Needed to Deliver Perfect VR?, http://edge-of-cloud.blogspot.fi/2016/11/what-kind-of-resolution-is-needed-to.html

[Fernandes] Combating VR Sickness through Subtle Dynamic Field-Of-View Modification, http://www.cs.columbia.edu/2016/combating-vr-sickness/images/combating-vr-sickness.pdf

[FP] LED Monitors can cause headaches due to flicker, http://www.flatpanelshd.com/focus.php?subaction=showfull&id=1362457985

[Forbes] Mayo Clinic May Have Just Solved One Of Virtual Reality's Biggest Problems, http://www.forbes.com/sites/jasonevangelho/2016/03/30/mayo-clinic-may-have-just-solved-one-of-virtual-realitys-biggest-problems

[GS] Virtual reality overcoming the barrier of motion sickness, http://www.glitchstudios.co/post-1/

[LaViola] A Discussion of Cybersickness in Virtual Environments, http://www.eecs.ucf.edu/~jjl/pubs/cybersick.pdf

[LI] Does FPS Matter? Decide for Yourself, http://blog.logicalincrements.com/2015/04/does-fps-matter-decide-for-yourself/

[MIT] In the blink of an eye, http://news.mit.edu/2014/in-the-blink-of-an-eye-0116

[MTG] Physics and Frame Rate: Beating motion sickness in VR, http://mtechgames.com/2015/12/09/physics-and-frame-rate-beating-motion-sickness-in-vr/

[Nature] Humans perceive flicker artifacts at 500 Hz, http://www.nature.com/articles/srep07861

[Oculus] Simulator Sickness, https://developer3.oculus.com/documentation/intro-vr/latest/concepts/bp_app_simulator_sickness/

[PS] PlayStation VR: The Ultimate FAQ, http://blog.us.playstation.com/2016/10/03/playstation-vr-the-ultimate-faq/

[RV] NVIDIA Demonstrates Experimental “Zero Latency” Display Running at 1,700Hz, http://www.roadtovr.com/nvidia-demonstrates-experimental-zero-latency-display-running-at-17000hz/

[Samsung] Samsung to Unveil Hum On!, Waffle and Entrim 4D Experimental C-Lab Projects at SXSW 2016, https://news.samsung.com/global/samsung-to-unveil-hum-on-waffle-and-entrim-4d-experimental-c-lab-projects-at-sxsw-2016

[TC] vMocion looks to end motion sickness in virtual reality by tricking your brain, https://techcrunch.com/2016/03/30/vmocion-looks-to-end-motion-sickness-in-virtual-reality-by-tricking-your-brain

[TT] AMD's graphics boss says VR needs 16K at 240Hz for 'true immersion', http://www.tweaktown.com/news/49693/amds-graphics-boss-vr-needs-16k-240hz-true-immersion/index.html

[UQO] Cybersickess, http://w3.uqo.ca/cyberpsy/en/cyberma_en.htm

[VE] Computer Monitors and Digital Televisions - Visual Sensitivity from Vestibular Disorders Affects Choice of Display, http://vestibular.org/sites/default/files/page_files/Computer%20Monitors%20and%20Digital%20Televisions_0.pdf

[WP1] Virtual reality sickness, https://en.wikipedia.org/wiki/Virtual_reality_sickness

[WP2] Refresh rate, https://en.wikipedia.org/wiki/Refresh_rate

[WP3] Fovea centralis, https://en.wikipedia.org/wiki/Fovea_centralis

Sunday, November 13, 2016

What Kind of a Resolution Is Needed to Deliver Perfect VR?

A big challenge for VR headsets is that when we are wearing the headset, our eyes are located very close to the display and focused through a pair of lenses. This makes individual pixels distinguishable even with today’s high-end VR headsets. Insufficient resolution reduces the immersive experience and makes for example reading small text problematic.

According to AMD’s Radeon Technologies Group, in order for VR to reach true immersion that one will not be able to tell apart from real world, 16K screens (probably 16K per eye) and 240fps are required [TT]. Elsewhere, Palmer Luckey, the founder of Oculus, has said that to get to the point where the user cannot see pixels, about 8K per eye would be needed. 16K is 15360x8640 pixels, whereas 8K is 7680x4320.

At least Sony has already released a smartphone, Xperia Z5 Premium, that has a 4K display [TR1]. Also Samsung has showed off a 4K smartphone display. Furthermore, Samsung is working on a new type of display that will have 11K resolution with 2250 pixels per inch [PL]. 11K could mean something like 10560x5940 pixels. Samsung expects to have a prototype ready by 2018. The screen could be available in smartphones in 2019.

But rather than resolution, a more suitable notion for VR environments is pixels per degree. If the human eye was a digital camera, it would have 60 pixels/degree at the fovea, which is the part of the retina where the visual acuity is highest [Sensics]. For VR goggles, the pixel density can be calculated by dividing the number of pixels in a horizontal display line by the horizontal FOV (Field of View) provided by the headset. HTC Vive has 1080 horizontal pixels per eye (i.e., 2160 in total) and a FOV of 110 degrees, resulting in a pixel density of 19.6 pixels/degree. This is still pretty far from 60 pixels/degree. For human eyes, the combined horizontal visual field is 180-200 degrees [WP1, WP2]. Thus, VR headset offering a wide 200 degree FOV with 60 pixels/degree might require a horizontal resolution of 6000 pixels. This means that Samsung’s coming 11K display might actually be able to reach 60 pixels/degree, which feels quite promising.

A wide FOV is crucial for an immersive VR experience by avoiding the tunnel vision impact and even allowing for peripheral vision and good sense of the user’s visual surroundings in the virtual environment [TV]. Some examples of FOVs of different VR headsets are as follows [VT]:

Google Cardboard: 90
Samsung Gear VR: 96
HTC Vive: 110
Oculus Rift: 110
VR Union Claire: 170
StarVR: 210

StarVR, which is still under development, has the highest FOV from among the VR headsets [VT]. 210 degrees offers a FOV beyond the human eye’s peripheral vision. According to some early reviews, the 210-degree FOV is an eye-opener in the virtual world and makes other headsets (e.g., the Rift and the Vive with their 110-degree FOV) feel like binoculars [RV]. Achieving a wide FOV requires a complex set of lenses. StarVR combines normal and Fresnel elements which, as a downside, brings challenges to clarity. StarVR has a per-eye resolution of 2560x1440 but the extra pixels (compared to competitors) are obviously stretched over the wider FOV.

Besides the horizontal FOV, also the vertical FOV has an impact on the VR experience, allowing the eyes to dart up and down more naturally without encountering the dark edges of the headset [TV]. The HTC Vive has a better vertical FOV than the Oculus Rift since it uses custom screens that are oriented vertically.

But what would be the ideal FOV for a VR headset? Even the VR experience using a VR headset with a 180-degree FOV will not not match how we experience the physical world. This is because the human eyes can see up to a 270-degree FOV if the eyes are fully rotated [VT]. Thus, even with a 180-degree FOV, one might still experience a tunnel vision effect (which is probably the reason StarVR went for 210 degrees).

An additional aspect to consider is that lower FOVs help with motion sickness [WP3, WS]. Decreasing the FOV results in less visual cues and the brain does not register as wide a discrepancy between sensory input from the eyes and the vestibular system in the inner ear (this mismatch in VR causes motion sickness). The challenge is naturally that reducing the FOV makes the VR experience less immersive by strengthening the tunnel vision effect. Fortunately, research is ongoing to solve this issue. As an example, researchers from Columbia University have developed an approach that involves masking the user’s view to minimize motion sickness symptoms [NA]. The approach adjusts the visual range on the fly to reduce motion sickness. When the player is in motion in the virtual environment, the system partially obscures each eye with a soft-edged, circular, virtual cutout, reducing the FOV to minimize motion sickness. When there is less action, the FOV is increased again.

Increased resolution and FOV or pixels per degree require of course more GPU power. Since the Sony Xperia Z5 Premium smartphone already comes with a 4K display, one could assume that someone packing two of those displays into a VR headset should not be that far off. Assuming that such a headset would have at least the same 90Hz refresh rate as the Rift and the Vive, we would need a GPU capable of running 4K@90fps on both of the displays, or one GPU for each of the displays. Nvidia’s latest Titan X desktop GPU, which is supposed to be the most powerful consumer graphics card on the market [TR2], has demonstrated performance of 4K@81fps (and should, according to the specs, be capable of 8K@60Hz, that is, 7680x4320@60Hz at the maximum) [NV]. Therefore, it would seem to be possible to run 2x4K@80Hz with two Titan X GPUs. The only problem is that two Titan X GPUs cost $2400.

So, to the conclusion. 4K smartphone displays are already here. 11K displays might be coming to smartphones in 2019. And today’s top consumer graphics cards can already do 8K@60fps. Thus, by the time Samsung’s 11K displays are out, it does not seem unrealistic to expect that we could have consumer graphics cards that can deliver 11K@120fps. So perhaps by 2019, we could reach an over 200 degree FOV with 60 pixels/degree and 120fps, that is, have VR that we cannot tell apart from the real world. However, it feels that 16K@240fps, which is AMD’s goal, is still further off to the future.

However, VR hardware makers are looking into ways to reduce the required processing power. One phenomenon that can come to help is foveal vision - when we look at an object in distance, the focal point of our gaze is in focus, while the scene around the object is blurred. In VR, this principle can be used to render fully only the specific area where the user is looking and leaving the rest of the scene at a far lower resolution [MT]. This has the potential to bring large performance gains and reduce the load of the GPU. At least the Fove VR headset and Nvidia are applying this trick to VR.

[AT] Virtual Perfection: Why 8K resolution per eye isn’t enough for perfect VR, http://arstechnica.com/gaming/2013/09/virtual-perfection-why-8k-resolution-per-eye-isnt-enough-for-perfect-vr/

[MT] Nvidia’s Eye-Tracking Tech Could Revolutionize Virtual Reality, https://www.technologyreview.com/s/601941/nvidias-eye-tracking-tech-could-revolutionize-virtual-reality/

[NA] Restricting field of view to reduce motion sickness in VR, http://newatlas.com/columbia-university-vr-motion-sickness/43855/

[NV] Nvidia Titan X, http://www.geforce.com/hardware/10series/titan-x-pascal

[PL] Forget about 4K and even 8K, Samsung is making 11K displays, http://www.pocket-lint.com/news/134580-forget-about-4k-and-even-8k-samsung-is-making-11k-displays

[RV] Hands-on: The New and Improved StarVR Prototype Will Give You Field-of-View Envy, http://www.roadtovr.com/starvr-headset-hands-on-field-of-view-e3-2016/

[SC] Are 4K 144hz gaming monitors coming out this year?, http://steamcommunity.com/discussions/forum/11/357288572138893130/

[Sensics] Understanding Pixel Density and Eye-Limiting Resolution, http://sensics.com/understanding-pixel-density-and-eye-limiting-resolution/

[TR1] Samsung teases its 4K, VR-ready phone display, http://www.techradar.com/news/phone-and-communications/mobile-phones/samsung-teases-its-4k-vr-ready-phone-display-1322341

[TR2] The 10 best graphics cards of 2016, http://www.techradar.com/news/computing-components/graphics-cards/best-graphics-cards-1291458

[TT] AMD's graphics boss says VR needs 16K at 240Hz for 'true immersion', http://www.tweaktown.com/news/49693/amds-graphics-boss-vr-needs-16k-240hz-true-immersion/index.html

[TV] 4 Most Important Tech Specs When Shopping For VR, https://topvr.co.uk/shopping/4-important-tech-specs-shopping-vr/

[VT] Comparison Chart of FOV (Field of View) of VR Headsets, http://www.virtualrealitytimes.com/2015/05/24/chart-fov-field-of-view-vr-headsets/

[WP1] Human Eye, https://en.wikipedia.org/wiki/Human_eye

[WP2] Field of View, https://en.wikipedia.org/wiki/Field_of_view

[WP3] Virtual Reality Sickness, https://en.wikipedia.org/wiki/Virtual_reality_sickness

[WS] 8 ways to prevent HTC Vive motion sickness, http://filmora.wondershare.com/virtual-reality/8-ways-to-prevent-htc-vive-motion-sickness.html

Saturday, November 12, 2016

The Components of Motion-to-Photon Latency in Virtual Reality Systems

Like I wrote in an earlier blog post [EC], Virtual Reality (VR) requires a low motion-to-photon latency, which is the time needed for user movement to be fully reflected on the display. Low latency is critical to delivering an engaging and comfortable VR experience. In real life, the motion-to-photon latency is essentially zero since our sensory system and motor systems are tighly coupled [Giz]. Most people agree that if the motion-to-photon latency is below 20ms, the lag is no longer perceptible [Oculus]. The latency that Oculus Rift Development Kits have achieved (I didn’t find any figures for the consumer version of the Rift) is typically in the range from 30ms to 50ms, including time for sensing, data arrival over USB, sensor fusion, game simulation, rendering, and video output. Sony’s PlayStation VR achieves a latency of less than 18ms [PS]. Thus, it appears that in the case of local rendering of VR content, the problem with motion-to-photon latency has been solved at least for most of the users. I am writing ”most of the users” since research has shown that the sensitivity to lag varies wildly – the most sensitive persons can notice lags of 3.2ms, whereas the least sensitive persons can accept hundreds of milliseconds [Giz].

What about live streaming of VR video? The way most solutions that stream VR video operate appears to be that they stream the full 360 degree spherical video to the end user, whose computer extracts the viewport and displays it at the HMD (Head-Mounted Display). This is how for instance VR live streaming on Microsoft Azure Media Services works [MS]. Another alternative that Facebook seems to be using is not to stream the entire 360 degree video, but slightly more than the visible field-of-view (FOV), which makes it possible for the user’s device to react to head movement locally [FB].

When streaming the full spherical video to the user, motion-to-photon latency is less of a problem since the user’s device can account for the head movement locally by presenting the subset of the 360 degree image that is within the user’s FOV. The downside of streaming the full spherical video is that it wastes bandwidth since most of the content is not within the user’s FOV. Also, if only the viewport was streamed to the user, a higher resolution for the FOV could be achieved. Thus, an attractive option would be for the remote server to stream only the exact viewport to the user. This, however, might not be entirely realistic if a 20ms motion-to-photon latency is desired. As an example, the Vahana VR live VR streaming solution from VideoStitch has an approximate delay of 300ms between a camera and a SDI TV connected to the SDI output of Vahana VR [VS]. Note that this delay does not include the delivery of the VR video over a network or CDN – if that were included, the total end-to-end delay would be approximately between 5-30s depending on the cache configuration of the CDN [VS].

So what are all the components of the end-to-end latency when streaming live video from a remote camera to a VR headset? These appear to include at least the following:

Camera latency - this is the latency from scene capture to start of video raster. According to [DP], the latency can be 5ms for 1080p video frame.
Image capture – the latency taken to ingest a video frame [DP]. Digital cameras use either CCD (Charge-Coupled Device) or CMOS (Complementary Metal Oxide Semiconductor) image sensors to convert what the lens sees into a digital format and copy it to memory. The capture frequency of the sensor defines how many exposures the sensor delivers per time unit, that is, how many frames it can capture per minute [Axis]. For instance a capture rate of 60fps means that the sensor captures one frame every 16.7ms, that is, the capture latency is 16.7ms.
Image enhancements – once the raw image has been captured, each frame goes through a pipeline of enhancement processing such as de-interlacing, scaling and image rotation. Each of these steps adds latency. The higher the resolution, the more pixels the processor needs to encode. However, the increase in processing time for a higher resolution can be balanced by a faster processing unit in high resolution cameras. According to [Cast], the capture post-processing latency is less than 0.50ms for 1080p30 video in a carefully designed low-latency video system that uses hardware codecs.
Transfer delay occurs when the frame needs to be sent from the camera over an interface. For 1080p video, this delay can be 10ms for PCIe transfer [DP]. Transmission over a USB interface may require 7ms per frame [Eberlein].
Stitching, which is the process of combining multiple images with overlapping fields of view to produce a panorama image. According to VideoStitch, their Vahana VR live streaming solution takes approximately between 25ms and 35ms to stitch a single image [VS]. Nokia’s OZO Live (real-time 360 stitching software) adds a 30-frame (1 second) delay by default (for 4K@30fps video) [Nokia].
Encoding – this is about accessing the captured picture from memory, encoding it, and providing the encoded picture in memory. More advanced compression algorithms produce a higher latency. However, while for instance H.264 is more advanced than MJPEG (which compresses each video frame separately as a JPEG image), the difference in latency during encoding is only a few microseconds. According to [Cast], the latency introduced by video coding, not including buffering, can be as low as 0.5ms for 1080p30 video in a low-latency video system. Based on [Axis], it can take 1ms for a camera to scale and encode a 1080p image.
Buffering – the encoder uses an input buffer and output buffer that contribute delay [Tiwari]. The input buffer is filled up with data which is sent to the codec for processing. Some coding formats for H.264 use an input buffer of one frame. The output buffer is needed since the encoder generates a variable size of encoded bitstream for each frame. If transmission is done at constant or regulated bit rate, bits need to be stored in an encoder output buffer. Encoded data is placed in the output buffer, whose content is consumed when it is full. The size of the output buffer can vary from a number of frames (e.g., more than 30) to a sub-frame (e.g., ¼ frame, that is, 8.3ms for 30fps video), meaning that the latency introduced by the buffer can vary from milliseconds up to one second [Cast].
Packetization & send – here the the encoded picture is packetized and the packetized data is sent over the network. According to [Cast], network processing such as RTP/UDP/IP encapsulation can take as little as 0.01ms or less for 1080p30 video in a low-latency video system.
Network delay – transmission of the packets over the best-effort Internet adds latency, jitter, and packet loss. Each hop in the Internet over which a packet goes introduces propagation delay, transmission delay, processing delay, and queuing delay [HPBN]. In the case of the propagation delay, if assuming a single-hop fiber-optic cable between New York and San Francisco, the propagation delay for a packet over that link would be 21ms [HPBN]. The transmission delay is the amount of time required to push all the packet’s bits into the link, which is a function of the packet’s length and the data rate of the links. Processing delay is the amount of time required to process the packet header, check for bit-level errors, and determine the packet’s destination. Queuing delay is the time the packet spends waiting in the router’s queue until it can be processed.
Receiving and de-packetization – in this step, the packetized data is received over the network, de-packetized, and provided in memory.
Decoding –the encoded picture is accessed from memory, decoded, and the decoded picture is provided in memory. The decoding latency depends on what hardware decoder support is present in the graphics card. It is typically faster to decode in hardware than in software due to latency overheads related to memory transfers and task-level management from the operating system. In a low-latency video system, the decompression delay could be as low as 0.5ms [Axis]. However, to ensure that the decoder does not ”starve” and that uniformed frame rate can be viewed to the user, a playout buffer is used to compensate for variations introduced by the network. This buffer contributes to the latency on the client side. The decoder buffer is the dominant latency contributor in most video streaming applications [D&R]. The latency added by the decoder buffer could vary from a number of frames (e.g., more than 30) to sub-frame (e.g., ¼ frame) [Axis]. In VoIP and video conferencing systems, the decoder buffer (a.k.a., jitter buffer) can introduce 30-100ms of latency (many systems use an adaptive jitter buffer). More information about jitter buffers is available in this blog post.
Display delay – includes copying the decoded picture from memory, encoding and serializing the stream of pixel color information (could be 1920x1080 pixels where each pixel requires 30bits of data – 8 bits for red, green and blue + bits required by TMDS (Transition Minimized Differential Signaling) [Syn], which is a technology used for transmitting high-speed serial data in HDMI) and sending it via via TMDS to the monitor, decoding the signal at the receiving device, and displaying the picture (frame/field) on the monitor [SE]. Based on [Cast], display pre-processing can take around 0.5ms. The display refresh rate also adds to the delay. For computer monitor frames the refresh rate is around 14-15ms [Axis]. Special gaming monitors can have a refresh rate of 4-5ms.

So how large of a motion-to-photon latency will we end up with if we add up all of the components above (assuming 1080p60fps video)?

Camera: 5ms
Capture: 16.7ms
Image enhancements: 0.5ms
Transfer: 10ms
Stitching: 25ms
Encoding: 1ms
Buffering at encoder: 16.7ms (full frame buffering)
Packetization & send: 0.01ms
Network delay: 10ms (assuming an LTE network)
Buffering at decoder: 30ms (assuming low level of jitter in the network)
Decoding: 0.5ms
Display pre-processing: 0.5ms
Display refresh rate: 16.7ms
Total: 132.6ms (which is missing e.g., operating system overhead)

One real-life example of end-to-end motion-to-photon latency is VideoStitch Vahana live VR software, which has a delay of 300ms between a camera and TV through SDI (i.e., not including all the steps listed above) [VS].

Note that not all of the components above are relevant in a scenario where the user is for instance streaming 360 video from a remote live concert to his or her VR headset, even if the viewport is created by a server in the cloud. In that scenario, it is important to minimize the latency associated with the feedback loop between the sensors on the VR headset capturing head movement and the viewport generated by the remote server reflecting the movement. The latency between the remote server and the remote cameras is typically less important. However, low end-to-end latency between the remote camera and the VR headset can be critical in a use case such as remote control of a drone that is flying at high speed.

[Axis] Latency in live network video surveillance, http://www.axis.com/files/whitepaper/wp_latency_live_netvid_63380_external_en_1504_lo.pdf

[Cast] Video Streaming with Near-Zero Latency Using Altera Arria V FPGAs and Video and Image Processing Suite Plus the Right Encoder, https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-cast-low-latency.pdf

[D&R] Understanding - and Reducing - Latency in Video Compression Systems, http://www.design-reuse.com/articles/33005/understanding-latency-in-video-compression-systems.html

[DP] Live Capture with Parallel Processing, https://www.datapath.co.uk/tbd/whitepapers/datapath_low_latency.pdf

[Eberlein] Understanding video latency, http://www.vision-systems.com/content/dam/VSD/solutionsinvision/Resources/Sensoray_video-latency_article_FINAL.pdf

[EC] Nokia Believes VR and AR Will Require A Highly Distributed Network, http://edge-of-cloud.blogspot.fi/2016/10/nokia-believes-vr-and-ar-will-require.html

[FB] Next-generation video encoding techniques for 360 video and VR, https://code.facebook.com/posts/1126354007399553/next-generation-video-encoding-techniques-for-360-video-and-vr/

[Giz] The Neuroscience of Why Virtual Reality Still Sucks, http://gizmodo.com/the-neuroscience-of-why-vr-still-sucks-1691909123

[HPBN] Primer on Latency and Bandwidth, https://hpbn.co/primer-on-latency-and-bandwidth/

[MS] Virtual reality live streaming on Azure Media Services, https://azure.microsoft.com/en-us/blog/live-virtual-reality-streaming/

[Nokia] How much delay does OZO Live add? How can I adjust the delay of OZO Live? https://support.ozo.nokia.com/s/article/ka058000000U1uTAAS/How-much-delay-does-LIVE-add-How-can-I-adjust-the-delay-of-LIVE

[Oculus] The Latent Power of Prediction, https://developer3.oculus.com/blog/the-latent-power-of-prediction/

[PS] PlayStation VR: The Ultimate FAQ, http://blog.us.playstation.com/2016/10/03/playstation-vr-the-ultimate-faq/

[SE] How does a GPU/CPU communicate with a standard display output? (HDMI/DVI/etc), http://electronics.stackexchange.com/questions/102695/how-does-a-gpu-cpu-communicate-with-a-standard-display-output-hdmi-dvi-etc

[Syn] Understanding HDMI 2.0: Enabling the Ultra-High Definition Experience, https://www.synopsys.com/Company/Publications/SynopsysInsight/Pages/Art3-hdmi-2.0-IssQ3-13.aspx?cmp=Insight-I3-2013-Art3

[Tiwari] Buffer-contrained rate control for low bitrate dual-frame video coding, http://code.ucsd.edu/pcosman/TiwariM_icip08.pdf

[VS] What is the latency to stream a 360 video? http://support.video-stitch.com/hc/en-us/articles/203657056-What-is-the-latency-to-stream-a-360-video-