June 28, 2025

MUHAMMAD GHIFARY

Introduction

Imagine walking around your living room with just a handful of photo views. Could you magically reconstruct it in 3D, spin it on your phone, and view it from any angle you choose?

That’s the magic of novel view synthesis — a frontier of vision and graphics R&D focused on generating realistic, previously unseen perspectives from existing images. It’s what powers immersive VR walkthroughs, cinematic camera relighting, and virtual twins of real-world scenes.

But with only a few snapshots, how can we handle missing corners, occlusions, or complex lighting? That’s where Neural Radiance Fields (NeRF) steps in (Mildenhall et al., ECCV 2020). It reframed the problem by modeling a scene as a continuous volumetric function from 5D inputs, i.e., point location $(x, y, z) \in \mathbb{R}^3$ and viewing direction $(\theta, \phi) \in \mathbb{R}^2$, to RGB colors ($\mathbf{c} \in \mathbb{R}^3$) and volume density or depth $(\sigma \in \mathbb{R})$, then using differentiable volume rendering to paint brand-new views.

Historically, novel view synthesis was performed through hole-filled meshes or patchwork image blending. NeRF skips those processes by learning light interaction, geometry, and view-dependent effects in a single neural network, producing photo-realistic novel views from sparse input — no stitch, no mesh, just neural rendering magic.

This article will elaborate the technical details of the basic NeRF implementation with Keras 3.

Novel View Synthesis with NeRF

Before delving into the detailed implementation, let’s formulate the novel view synthesis problem and how NeRF tries to solve it. Novel view synthesis is a task which consists of generating images of a specific scene from a specific point of view, when the only available information is pictures taken from different points of view.

Source: https://paperswithcode.com/task/novel-view-synthesis

Source: https://paperswithcode.com/task/novel-view-synthesis

Initially, we have a set of source images $\{ \mathbf{I}_i \}_i^N$ and their corresponding camera poses $\{ \mathbf{P}_i \}$. The goal is to predict the novel view image $\mathbf{I}$ from a new target pose $\mathbf{P}$.

Formally, novel view synthesis attempts to construct a function $F$ leveraging a dataset of image views $\mathcal{D} = \{ \mathbf{I}_i, \mathbf{P}_i\}_i^N$ such that

$$ \mathbf{I}(u, v) = F(\mathbf{P}(u, v); \mathcal{D}) $$

where $(u, v)$ are pixel coordinates and the space of $\mathbf{I}$ represents a scene. Early methods factorize this into:

NeRF frames this problem into modeling a scene as a 5D continuous function $f_\omega: \mathbb{R}^5 \rightarrow \mathbb{R}^4$, which a trainable neural network with parameter $\omega$.

The 5D input space comprises a point location randomly sampled on the camera ray (which is basically a straight line), $\mathbf{r} = (x, y, z)$ and the ray’s viewing angle $\mathbf{d}= (\phi, \theta)$. In practice, the 2D viewing angle can be converted into a 3D unit directional vector $\mathbf{d} = (d_x, d_y, d_z)$.

The 2D output space contains the volume density $\sigma \in \mathbb{R}$ and the RGB color $\mathbf{c} \in \mathbb{R}^3$.