Optimizing 3D Gaussian Splatting on TPU with JAX

April 30, 2026

MUHAMMAD GHIFARY

3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art technique for real-time radiance field rendering, offering a compelling alternative to neural volume integration methods like NeRF. In the previous attempt, I implemented the 3DGS algorithm purely in JAX. However, that version was a naive approach and did not fully harness JAX’s performance on accelerators. There is still plenty of room to improve runtime.

This article discusses how to further optimize 3DGS across multiple Tensor Processing Units (TPUs). In general, the strategy involves restructuring the rasterization implementation and exploiting batched data parallelism. The jax-gs project addresses these challenges by reformulating 3DGS within the JAX framework and leveraging XLA (Accelerated Linear Algebra) to compile the entire training and rendering pipeline into highly optimized machine code.

This transition from a dynamic, CUDA-centric model to a static-shape, JIT-compiled architecture allows jax-gs to exploit the massive parallel processing power of TPUs while maintaining numerical stability and structural consistency. The codebase is also research-friendly that benefits from JAX’s composable transformations (e.g., vmap, pmap, grad).

About Tensor Processing Units (TPUs)

Before diving into the optimization strategy, let’s briefly discuss the hardware accelerator itself. A basic understanding of TPU architecture will help us plan the optimization more effectively.

TPUs are Google's custom-developed application-specific integrated circuits (ASICs) designed specifically to accelerate machine learning workloads. Unlike general-purpose GPUs, TPUs are architected around the requirements of deep learning, prioritizing high-throughput matrix multiplications and low-latency interconnects.

TPU Evolution

Since their debut in 2015, TPUs have evolved from specialized inference engines into the backbone of global AI:

Screenshot 2026-05-04 at 17.20.09.png

TPU v1: A pure inference chip that powered Google Search and AlphaGo
TPU v2: The first version capable of larger-scale training, introducing the bfloat16 format.
TPU v3: Doubled performance and introduced liquid cooling to handle the heat of massive scale.
TPU v4: Introduced 3D Torus network topology and SparseCore for embedding acceleration.
TPU v5: Split into v5e (efficient/cost-optimized) and v5p (performance flagship for models like Gemini).
Trillium / TPU v6e: The recent generation, featuring a 256x256 MXU and 4.7x the peak compute of v5e.
Ironwood / TPU7x: Employs a dual-chiplet architecture and 192GB of HBM3e, optimized for massive-scale inference and frontier “reasoning” models.

Here are the key architectural innovations in TPUs.

Systolic Array & MXU: Unlike the many-core SIMT architecture of GPUs, TPUs use a systolic array design where data flows through a grid of multiply-accumulators (MXUs). This minimizes memory access during matrix multiplications.
Optical Circuit Switching (OCS): Starting with TPU v4, Google replaced traditional electronic switches with OCS, allowing the interconnect topology (3D Torus) to be reconfigured dynamically. This facilitates Inter-chip interconnect (ICI) resiliency by routing around optical faults.