Okay, let's imagine it's the future and you've got your fancy light field display. What are you going to watch? Right now, some of the best content available is built using computer-generated techniques, but if you want to see live action or just something that's not a video game, you need something different. Since there are currently no native light field camera systems, there are computer-vision enabled technologies that help to facilitate light field captures. Usually, those systems start with big arrays of cameras or sensors, and the techniques diverge from there.
But first, we need to understand the process of creating content for a light field display. What makes light field displays especially cool is also what makes them especially challenging: you're creating an image so dense it's virtually indistinguishable from the real world. That's a lot of discrete light rays you have to recreate: every possible object point from every possible angle, given certain viewing conditions. It's not "infinite" because it can be quantized or constrained, but for high fidelity light field displays you're going to need tens of billions of rays of light. That's a lot of rays!
Now, you practically can't capture that many pixels. Even if you wanted to assemble a ten-gigapixel camera array—maybe a couple thousand high-end cameras—you simply couldn't pack them tightly enough together to get every single light ray you care about. There would be gaps between cameras, because a lot of the lens isn't actually collecting light; it's blocking out non-image-forming light. Even packing cameras side-to-side throws away more than 99% of the light you care about. How you recover those 99 rays out of 100 — or more likely, the 999,999,999 rays out of a billion — determines the strengths and weaknesses of the approach.
For my money, you first face a philosophical choice: are you trying to capture and recreate the light rays from reality (e.g. lots of cameras), or are you trying to synthesize realistic looking light rays (e.g. ray tracing)? Both choices are valid, and both lead to wildly different technical options. If you're willing to simulate reality, you can take a lot of shortcuts: the images coming off your cameras are raw material for a sophisticated simulation of reality, and you can employ tools from the animation industry to simulate everything from hair to subsurface light scattering in skin. To be fair, this is more or less what high-end, visual effects-heavy movies already do today: the output of the camera is material for digital artists and cutting-edge software to create the images the storytellers want. Let's call these synthetic light fields for now.
The second option is to try to recreate each light ray as faithfully as possible, even the ones you didn't manage to sample directly. Light field capture then becomes a signal-processing problem: what do we know about light fields as a signal, and how can we recreate the signal itself as photographically as possible? This means that the light field has the lighting conditions baked into the recreated rays just like other photographic capture. [Full disclosure: this is how Visby approaches the problem; while both approaches are valid, I'm very biased toward the "capture-and-playback" approach instead of "simulate reality."] In contrast to the first synthetic option, let’s call these interpolated light fields for now.
The leading candidate on the synthetic side of the fence is often referred to as “volumetric video.” In this approach, you use a huge array of cameras (typically 50 to 100+) to capture desired surfaces of a performer and use a variety of computer vision techniques to understand the “shape” of the scene. The resulting shape is encoded as either polygons or point clouds and mapped with color (an RGB texture), which can then be played back and manipulated in a 3D engine. In this case, the native light field is lost as the light field is mapped into a single texture, and every 2D-sampled light ray is simulated from that resulting shape + color construct. In other words, any light ray you want to play back is cast against the shape, and whatever 2D color value is there is the color and intensity of your light ray. At the low end, this can employ fewer cameras, and may appear less accurate—but high-end operators can produce very realistic imagery.
This approach can be further augmented with dynamic lighting, more complex materials, or a variety of more sophisticated approaches that can improve the fidelity. In the extreme case, you capture not just a series of images, but you attempt to separate the reflectance of the object (reflections, refractions, specular highlights from the diffuse lighting, etc.). This allows you to infer certain material properties and better simulate images, because you understand how the light rays move across the surfaces of the object. This is the approach taken by the researchers behind the various light stages from ICT at USC and OTOY. Note that this is the pinnacle of the “simulate reality” approach, as it gives you high quality raw material for the imaging engine to use to synthesize light rays.
This methodology contrasts with photographic light field capture, which uses a similar or potentially greater number of cameras but takes a different path in processing. For native light field capture, the goal is to capture enough light rays that you can interpolate the missing rays effectively. This approach was demonstrated by Lytro as well as other companies that built VR camera systems dense enough that the interpolation was relatively constrained: when you've got cameras a few centimeters apart you have a pretty strong idea of what those in-between rays probably looked like. (That's not to say it's an easy problem! Just not impossible.) This is also the approach taken by Visby today, though our camera arrays are much more widely spaced than the arrays used by Lytro and some others. After capture, a light field system typically retains most or all of the light ray samples as part of the data set and conveys those light rays down to the client device. Rather than using an integrated shape + color to infer light rays, a light field codec uses the original light rays (or some data proxy for them) to build up each novel view. In theory, this means higher data rates, but in the service of more photographically sampled images.
In practice, the two approaches have a lot in common, and are often not clearly differentiated. Much of the research in the domain blends techniques from both sides, and the work to compress and deliver these images borrows from many of the same technologies. In the future, we're going to need both: we'll have VFX-heavy Hollywood-style productions, and we'll have more natively photographed light fields for recording and playback. There is still a lot of research and development required on both sides of the process, but with recent innovations, including Light Field Lab’s work on holographic displays, there are a lot of exciting things heading your way in the future. So stay tuned!