Diffusion Models Encode the Intrinsic Dimension of Data Manifolds

Overview

In this work, we theoretically and experimentally prove that diffusion models, extensively used in generative modeling, capture the intrinsic dimension of the data. We introduce a novel method for estimating this intrinsic dimension directly from a trained diffusion model.

Problem Setup

We consider the following setup:

There is a data manifold \( \mathcal{M} \) of intrinsic dimension \( k \), embedded in an ambient Euclidean space \( \mathbb{R}^d \), where \( d \gg k \).
There is a probability distribution \( p \) that is highly concentrated around \( \mathcal{M} \).
We are given a finite sample of data \( \{ x_i \}_{i=1}^n \subseteq \mathbb{R}^d \) generated from \( p \).

Diffusion models are generative models designed to learn \( p \), but they don't explicitly find \( k \). We show how one can extract \( k \) by training a diffusion model on the data.

Diffusion Models and the Score Function

We consider an Itô diffusion:

\[ dx = f(x, t)\, dt + g(t)\, dW \]

The time-reversed stochastic differential equation (SDE) is:

\[ dx = \left[ f(x, t) - g(t)^2 \nabla_x \ln p_t(x) \right] dt + g(t)\, dW \] Diffusion models \( s_\theta(x, t) \) are essentially trained to approximate the score function \( \nabla_x \ln p_t(x) \), which represents the gradient of the log-density of the noisy data distribution at time \( t \). Even if they are trained to predict the noise \(n\) or the unperturbed image \(x_0\), their output can be trivially rescaled and shifted to approximate the score function.

Figure: Forward and reverse SDEs

Key Observation: Score Field Perpendicular to the Manifold

For small diffusion times \( \varepsilon \), the score function \( \nabla_x \ln p_{\varepsilon}(x) \) becomes approximately perpendicular to the manifold \( \mathcal{M} \).

To formalize this observation, let \( \pi(x) \) denote the projection of the point \( x \) onto \( \mathcal{M} \), \( \mathcal{N}_{\pi(x)}\mathcal{M} \) denote the normal space and \( \mathcal{T}_{\pi(x)}\mathcal{M} \) denote the tangent space of \( \mathcal{M} \) at \( \pi(x) \). The score vector \( \nabla_x \ln p_{\varepsilon}(x) \) predominantly lies within the normal space \( \mathcal{N}_{\pi(x)}\mathcal{M} \). Mathematically, this means that its projection onto the normal space is significantly larger than its projection onto the tangent space: \[ \left\| \text{Proj}_{\mathcal{N}_{\pi(x)}\mathcal{M}} \left[ \nabla_x \ln p_{\varepsilon}(x) \right] \right\| \gg \left\| \text{Proj}_{\mathcal{T}_{\pi(x)}\mathcal{M}} \left[ \nabla_x \ln p_{\varepsilon}(x) \right] \right\|. \] This phenomenon occurs because, at small \( \varepsilon \), the probability density \( p_{\varepsilon}(x) \) is sharply concentrated around \( \mathcal{M} \). The variations in \( p_{\varepsilon}(x) \) are much more pronounced in the normal direction than in the tangent direction, causing the gradient \( \nabla_x \ln p_{\varepsilon}(x) \) to point primarily towards \( \mathcal{M} \) along the normal.

Intrinsic Dimension Estimation Method

This key observation suggests that by sampling sufficiently many perturbed points around a datapoint \( x_0 \in \mathcal{M} \), the score vectors at these points will span the normal space \( \mathcal{N}_{x_0}\mathcal{M} \). This means that if we collect the score vectors in a matrix and perform svd on this matrix, we should see a spectrum drop exactly at the dimension of the normal space \(\mathcal{N}_{x_0}\mathcal{M}\). This insight forms the basis of the following straightforward algorithm for extracting the intrinsic dimension from a trained diffusion model.

Algorithm: Intrinsic Dimension Estimation

Input:

\( s_\theta \): trained diffusion model (score function)
\( t_0 \): small sampling time
\( K \): number of perturbed points

Algorithm:

Sample \( \mathbf{x}_0 \) from the dataset \( p_0(\mathbf{x}) \).
Set \( d = \text{dim}(\mathbf{x}_0) \).
Initialize an empty matrix \( S \).
For \( i = 1, \dots, K \):
- Sample \( \mathbf{x}_{t_0}^{(i)} \sim \mathcal{N}(\mathbf{x}_0, \sigma_{t_0}^2 \mathbf{I}) \).
- Append \( s_\theta(\mathbf{x}_{t_0}^{(i)}, t_0) \) as a new column to \( S \).
Perform Singular Value Decomposition (SVD) on \( S \) to obtain singular values \( (s_i)_{i=1}^d \).
Estimate \( \hat{k}(\mathbf{x}_0) = d - \arg\max_{i=1,\dots,d-1} (s_i - s_{i+1}) \).

Output:

Estimated intrinsic dimension \( \hat{k}(\mathbf{x}_0) \).

Theoretical Results

Using the notion of tubular neighbourhood and Morse theory, we rigorously proved the following theorem, confirming our initial intuition. Full details of the proof are provided in Appendix D of the paper.

Theorem

The ratio of the projection of the score \( \nabla_{\mathbf{x}} \ln p_t(\mathbf{x}) \) onto the tangent space of the data manifold \( T_{\pi(\mathbf{x})}\mathcal{M} \) to its projection onto the normal space \( \mathcal{N}_{\pi(\mathbf{x})}\mathcal{M} \) approaches zero as \( t \) approaches zero, i.e., \[ \frac{\|\mathbf{T} \nabla_{\mathbf{x}} \ln p_t(\mathbf{x})\|}{\|\mathbf{N} \nabla_{\mathbf{x}} \ln p_t(\mathbf{x})\|} \to 0, \quad \text{as } t \to 0. \] Here, \( \mathbf{N} \) and \( \mathbf{T} \) are projection matrices onto \( \mathcal{N}_{\pi(\mathbf{x})}\mathcal{M} \) and \( T_{\pi(\mathbf{x})}\mathcal{M} \), respectively. Thus, for sufficiently small \( t \), the score \( \nabla_{\mathbf{x}} \ln p_t(\mathbf{x}) \) is (effectively) contained in the normal space \( \mathcal{N}_{\pi(\mathbf{x})}\mathcal{M} \).

Experiments

We validated our method on various datasets, both synthetic and real-world, comparing it with traditional intrinsic dimension estimation techniques.

Synthetic Manifolds

\( k \)-Spheres in 100 Dimensions

We generate \( k \)-dimensional spheres embedded in \( \mathbb{R}^{100} \) using random isometric embeddings.

10-Sphere (\( k = 10 \)): Estimated \( \hat{k} = 11 \)
50-Sphere (\( k = 50 \)): Estimated \( \hat{k} = 51 \)

Mammoth Manifold

A 2D manifold embedded in 100 dimensions. Estimated: \( \hat{k} = 2 \)

Spaghetti Line in 100 Dimensions

A 1D curve embedded in \( \mathbb{R}^{100} \). Estimated: \( \hat{k} = 1 \)

Union of Manifolds: 10-Sphere and 30-Sphere

We analyze a dataset composed of a union of two spheres.

The estimated dimensions vary locally, reflecting the manifold’s structure.

Image Manifolds

Square Manifold

Samples from the Square Manifold for different dimensions:

Spectra of singular values:

Gaussian Blobs Manifold

Samples from the Gaussian Blobs Manifold:

Spectra of singular values:

Real-World Data: MNIST

We applied our method to the MNIST handwritten digits dataset, analyzing each digit class separately.

Figure: MNIST Score Spectra

Figure: Autoencoder Validation

Estimated intrinsic dimensions for each digit:

Digit	0	1	2	3	4	5	6	7	8	9
Estimated \( \hat{k} \)	113	66	131	120	107	129	126	100	148	152

Summary of Experimental Results

We compared our method with other intrinsic dimension estimation techniques.

Dataset	Ground Truth	Ours	MLE (m=5)	MLE (m=20)	Local PCA	PPCA
Euclidean Data Manifolds
10-Sphere	10	11	9.61	9.46	11	11
50-Sphere	50	51	35.52	34.04	51	51
Spaghetti Line	1	1	1.01	1.00	32	98
Image Manifolds
Squares (k=10)	10	11	8.48	8.17	10	10
Squares (k=20)	20	22	14.96	14.36	20	20
Squares (k=100)	100	100	37.69	34.42	78	99
Gaussian Blobs (k=10)	10	12	8.88	8.67	10	136
Gaussian Blobs (k=20)	20	21	16.34	15.75	20	264
Gaussian Blobs (k=100)	100	98	39.66	35.31	18	985
MNIST
All Digits	N/A	152	14.12	13.27	38	706

Limitations

Approximation Error: Caused by imperfect score approximation \( s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}} \ln p_t(\mathbf{x}) \).
Geometric Error: Arises when \( t \) isn't sufficiently small, leading to:
- Increased tangential components of the score vector.
- Differences in normal spaces across sampled points due to manifold curvature.

Conclusions

Our estimator offers accurate intrinsic dimension estimates even for high-dimensional manifolds, indicating superior statistical efficiency compared to traditional methods.
This improvement is credited to the inductive biases of the neural network estimating the score function, the critical quantity for intrinsic dimension estimation.
Our theoretical results show that the diffusion model approximates the normal bundle of the manifold, providing more information than just the intrinsic dimension.
We can potentially use a trained diffusion model to extract other important properties of the data manifold, such as curvature.