[WIP] Opening the Latent Diffusion Models “Black-Box” through JAX from Scratch

May X, 2026

MUHAMMAD GHIFARY

Latent Diffusion Models (LDMs) have redefined the state-of-the-art in visual generative AI, powering tools from Stable Diffusion (Rombach et al. 2022) to specialized research pipelines. However, the internal mechanics — moving from pixel space to latent manifold, and the iterative denoising — often remain a “black-box” to many.

For educational purposes, I created the ldmax (Latent Diffusion Models with jAX) code repository that demonstrates the from-scratch implementation using JAX and Flax NNX.

This article demystifies these components through the lens of a from-scratch implementation using JAX and Flax NNX. We explore the theoretical foundations of DDPM and DDDIM, the architectural shift toward Diffusion Transformers (Peebles et al. 2023), and the practical efficiencies of scaling research on Google Cloud TPUs.

Introduction

Generative modeling is fundamentally an exercise in probability density estimation. For high-dimensional visual data, this was historically a monumental challenge. Diffusion models solved this by breaking the generative task into hundreds of tiny, manageable steps. Instead of asking a model to “draw an image”, we ask it to “remove a tiny bit of noise” gradually.

In this article, we will dissect the three primary pillars of a modern LDM: the Iterative Denoising Pipeline (DDIM), the Variational Autoencoder (VAE), and the Diffusion Transformer (DiT).

Background: From Pixels to Noise and Back

Before we dive into the mathematical mechanics, let’s consider a physical analogy. Imagine a sculptor standing before a block of marble. Deep inside that block is a statue, and the sculptor’s job is to remove the excess stone to reveal the form.

Diffusion models work in a similar way. They don’t start with a blank canvas and draw lines. Instead, they start with a “block” of random, chaotic noise — like the static on an old TV. The model has been trained to “see” a structure (like a face or a landscape) within that chaos. Over a series of many tiny steps, the model identifies a bit of noise that doesn’t belong and removes it. By repeating this process, it slowly “sculpts” a clear image out of total randomness.

Illustration of Reverse Diffusion, generated by Gemini

This transformation from chaos to clarity is what we call Reverse Diffusion. To achieve this, we first have to teach the model what “noise” looks like by taking real images and slowly destroying them with static — this is the Forward Process.

DDPM: Stochastic Foundation

Denoising Diffusion Probabilistic Models (DDPM) establish a generative framework based on two symmetric processes: the Forward Diffusion (adding noise) and the Reverse Diffusion / Denoising (removing noise).

Forward Process (Markov Chain)

The forward process $q$ incrementally adds Gaussian noise to the initial data $x_0$ over $T$ timesteps. Each step is defined by a variance schedule $\beta_t$:

$$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) $$

A key mathematical property of DDPM is that we can sample $x_t$ at any arbitrary timestep $t$ directly from $x_0$ without iterating through the entire chain. By defining $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^t \alpha_s$, we use the reparameterization trick:

$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) $$