NeRFs & Gaussian Splatting
1. Inverse Rendering
Traditional rendering is slow, & creating 3D assets is hard. Inverse Rendering takes a rendered image & tries to reconstruct the 3D scene.
The input is many images of a static scene, with unchanging lighting. The output is a function that can render the scene from any viewpoint.
1.1 Optimization with SGD
Where:
are the posed photographs of the scene (we know the camera position for each image). is the rendering function parameterized by .
We want to minimize the difference between the rendered image and the actual image across all camera positions.
1.2 Initial Conditions
The initial conditions very important for reaching a good local minimum.
1.3 Volume Rendering
The functions for volumetric rendering happen to be extremely convenient for gradient descent optimization. We can simplify it here by ignoring scattering, assuming all particles either absorbs or emit light:
Where:
is a point inside the volume along the ray from the camera to the scene. is the camera position. is the direction of the ray from the camera to the scene. is the emitted radiance from a point in direction . is the density at point . is the transmittance from point to , which accounts for how much light is absorbed along the path.
The directional component
To approximate this integral, we take samples along the ray and sum their contributions, which can be efficiently computed using numerical integration techniques. We also have
Where:
is the transmittance up to the -th sample. is the segment opacity for the -th sample, which can be computed as , where is the distance between samples. is the color emitted from the -th sample, which can be computed as .
2. 3D Representations
An explicit representation defines each point, while an implicit representation states a point with a condition function.
With
A constructive solid geometry represents surfaces by boolean operations on primitives.
A neural network can represent a continuous function of the entire scene.
3. NeRFs
We have a volumetric cloud, and a set of posed images. To represent the volume, we use a Neural Radiance Field (NeRF), which is a fully connected neural network that takes in a 3D position and a viewing direction, and outputs the density and emitted radiance at that point.
Positional Encoding is used to help the network learn high-frequency details
3.1 Representations
- Gaussians: fewer primitives than voxels; better view-dep than point clouds; fast splat rendering (> ray tracing).
- Voxel grids: regular voxel grid; easy to implement,
access, but high memory requirements & limited by Nyquist. - NeRF: continuous scene fn; captures fine detail + view-dep; slow train/render (complex scenes).
- Hash Grids: sparse voxel grid with hash-based indexing; fast access, reduced memory; good for large scenes, but may struggle with fine details.
4. Gaussian Splatting
Primitive Based representations use rasterization instead of raytracing. It uses a point cloud (unstructured geometry). It has a multi-chart manifold, meaning it can represent complex surfaces without needing a single continuous parameterization. Its independent to permutations, so we can sort points in screen space. It also has a view-dependent representation, allowing for better handling of view-dependent effects like specular highlights.
4.1 Surface Splatting
- Consider oriented points (surfels) as discrete samples of a texture function on a surface.
- A Gaussian reconstruction kernel is used to recover a continuous signal.
- This is then sampled in screen space.
- The points are scaled with camera distance so that objects have no holes.
- Slanted normals appear as ellipses, so we can create good edges.
- Each sample can be processed in parallel.
4.2 Volume Splatting
Instead, we can use oriented ellipsoids as primitives, which can represent volumetric data. This allows for better handling of complex scenes with varying levels of detail, and can capture view-dependent effects more effectively than surface splatting.
To blend points in screen space, use alpha blending (as this is differentiable). This allows us to give each point an opacity value, which can be used to create smooth transitions between points and capture fine details in the scene.
To render:
- Splat: compute the shape of the Gaussian after projection. The center is projected as before, but the shape's (covariance matrix) transformation must be approximated using the first terms of the Taylor Series to ensure affine transformations (Gaussians are closed after affine transformations).
- Sort: globally sort the points by depth.
- Blend: alpha composite.
To optimise a covariance matrix, reparameterize with a rotation and scaling matrix, which are easier to optimize than the covariance matrix directly:
.
To splat:
- Given point cloud
, and point on the surface. - Create local parameterization for
neighbours of . - Each 3D point is a ssociated to a local 2D coord
, . - A continuous surface function is then
, where is a weight and is a Gaussian kernel.