The space of Dirac measures is dense in Wasserstein space

Optimal Transport

Author

Kisung You

Published

May 17, 2025

1 Introduction

In the study of probability distributions, the 2-Wasserstein space of measures, often denoted as \(\mathcal{P}_2(\mathbb{R}^d)\), has emerged as a foundational structure in optimal transport and statistical inference (Villani 2003). This space supports notion of distance, geometry, and convergence that are both theoretically robust and computationally relevant.

A particularly deep yet accessible fact about this space is the following:

The set of all finitely supported probability measures (i.e., finite mixture of point masses) is dense under the 2-Wasserstein metric.

This statement has substantial consequences for both theory and practice. It legitimizes the use of discrete approximations such as empirical distributions generated by MCMC in high-level statistical tasks such as Bayesian inference, model aggregation, and distributional optimization. This note offers a comprehensive explanation of what this density means, how to prove it, and why it is consequential.

2 Preliminaries: The Wasserstein Space

Let \(\calP_2 (\bbR^d)\) denote the set of all Borel probability measures \(\mu\) supported on \(\bbR^d\) with finite second moments \[ \int_{\bbR^d} \|x\|^2 d\mu(x) < \infty. \] The 2-Wasserstein distance between two measures \(\mu, \nu \in \calP_2(\bbR^d)\) is defined by \[\begin{equation} W_2^2 (\mu,\nu):= \underset{\gamma \in \Gamma(\mu, \nu)}{\inf}~ \int_{\bbR^d \times \bbR^d} \|x-y\|^2 d\gamma(x,y), \end{equation}\] where \(\Gamma(\mu, \nu)\) is the set of all joint distributions, also known as couplings, with marginals \(\mu\) and \(\nu\). Intuitively, \(W_2\) measures the optimal cost of transporting one distribution into the other when cost is quadratic in displacement.

This space is not only a complete and separable metric space, but it also carries a Riemannian-like geometry that underpins interpolation, gradient flow, and barycenter constructions (Otto 2001).

3 What Does “Dense” Mean in Wasserstein Geometry?

In general topology, a subset \(A\) of a metric space \((X,d)\) is called dense if for all \(x \in X\) and any \(\epsilon > 0\), there exists \(a \in A\) such that \(d(x,a) < \epsilon\). In our context, we consider the set \[ \calD := \left\lbrace \sum_{i=1}^n \lambda_i \delta_{x_i}\mid \lambda_i > 0, ~\sum_{i=1}^n \lambda_i = 1,~x_i\in\bbR^d,~ n\in \mathbb{N} \right\rbrace, \] which consists of finite convex combinations of Dirac measures.

The notion of density (not to be confused with the density of a measure) is understood in terms of the Wasserstein metric, not in terms of weak convergence or total variation. To understand the distinction, consider the following example. Think of the true distribution as a smooth landscape and discrete point masses as piles of sand. Wasserstein distance measures the work needed to move the sand to match the landscape, capturing both where the mass lies (location) and how much lies there (moment). Weak convergence is like viewing the scene from far away. You see whether piles are roughly in the correct places but ignore how tall they are. That means, moments may diverge. Total variation demands every grain be placed exactly right, which can be far too strict for sample-based approximation.

Positioned between those extremes, \(W_2\) guarantees (1) weak convergence plus convergence of second moments, and (2) expectations for functions with quadratic growth. These make it an ideal notion for statistical applications where second-moment behavior is central. With this clarified, now let’s prove that \(\calD\) is dense in \((\calP_2, W_2)\).

4 Proof

The proof consists of three parts for easy understanding.

4.1 Truncation to Compact Support

We want to show that dirac measures can approximate any measure \(\mu \in \calP_2(\bbR^d)\). For each \(R > 0\), define the closed ball of radius \(R\) centered at the origin as \[ B(0,R):=\lbrace x\in\bbR^d \mid \|x\| \leq R \rbrace. \] We construct a truncated version of \(\mu\) supported on this ball. Define the measure \(\mu_R\) by \[\begin{equation} \mu_R(A) = \frac{\mu(A\cap B(0,R))}{\mu(B(0,R))}, \end{equation}\] for all Borel sets \(A \subset \bbR^d\). This is a probability measure supported on \(B(0,R)\), provided \(\mu(B(0,R))>0\). Since \(\mu \in \calP_2\), we have \(\mu(B(0,R)) \to 1\) as \(R \to \infty\).

We now verify that \(W_2(\mu_R, \mu) \to 0\) as \(R \to \infty\). To do this, define a coupling \(\gamma_R \in \Gamma(\mu,\mu_R)\) by setting \[ \gamma_R := \frac{1}{\mu(B(0,R))}\cdot \mu\vert_{B(0,R)}(x) \otimes \delta_x (y) \in \calP_2(\bbR^d \times \bbR^d). \] This coupling transports mass inside \(B(0,R)\) directly to itself while the mass outside the ball is ignored. Then, \[ \int \|x-y\|^2 d \gamma_R (x,y) = \frac{1}{\mu(B(0,R))} \int_{B(0,R)} \|x - x\|^2 d\mu(x) = 0. \] Therefore, the transport cost under this coupling is zero inside the ball, but there is unmatched mass outside, which we account for.

Now consider the decomposition \[ W_2^2(\mu,\mu_R) \leq \int_{\bbR^d \backslash B(0,R)} \|x\|^2 d\mu(x) + (1-\mu(B(0,R)))\cdot R^2. \] The first term represents contribution of tail mass and the second term is cost of reallocating mass. As \(R\to\infty\), the first term vanishes because of integrability of \(\|x\|^2\). We also know that the second term vanishes as well since \(\mu(B(0,R)) \rightarrow 1\). Therefore, \[ \lim_{R\to \infty} W_2 (\mu, \mu_R) = 0. \]

4.2 Quantization to Finitely Supported Measures

Fix \(R>0\) and let \(\mu_R\) be the compactly supported measure we just constructed. Since \(\mu_R\) is supported on a compact set \(K = B(0,R)\), we can apply a canonical quantization argument. Partition the set \(K\) into a finite number of disjoint Borel sets \(Q_1, \ldots, Q_k\) such that:

For each \(i=1,\ldots,k\), choose a representative point \(x_i \in Q_i\) and define the finitely supported measure \[ \mu_R^\epsilon := \sum_{i=1}^k \mu_R (Q_i) \delta_{x_i} \in \calD. \] We can construct a coupling \(\gamma \in \Gamma(\mu_R, \mu_R^\epsilon)\) by pushing the entire mass from \(Q_i\) to \(x_i\). Then, \[ W_2^2(\mu_R, \mu_R^\epsilon) \leq \sum_{i=1}^k \int_{Q_i} \|x - x_i\|^2 d\mu_R(x) \leq \epsilon^2. \] Hence, \(W_2 (\mu_R, \mu_R^\epsilon) \leq \epsilon\).

4.3 Convergence via Triangle Inequality

Now fix an arbitrary \(\delta > 0\). From the first step, one can choose \(R\) large enough so that \[ W_2(\mu,\mu_R) < \frac{\delta}{2}. \] From the second step, \(\epsilon\) can be chosen small enough so that \[ W_2 (\mu_R, \mu_R^\epsilon) < \frac{\delta}{2}. \] By the triangle inequality, we arrive at the following result: \[ W_2(\mu,\mu_R^\epsilon) \leq W_2(\mu,\mu_R) + W_2 (\mu_R, \mu_R^\epsilon) < \frac{\delta}{2}+\frac{\delta}{2} = \delta. \] Since \(\mu_R^\epsilon \in \calD\), we have shown that for any \(\mu \in \calP_2(\bbR^d)\) and any \(\delta > 0\), there exists a finitely supported measure \(\nu \in \calD\) such that \(W_2(\mu, \nu) < \delta\).

Thus, the set \(\calD\) is dense in \((\calP_2(\bbR^d), W_2)\).

5 Why This Matters in Bayesian Inference

The density of Dirac mixtures in Wasserstein space is not just a geometric curiosity. Indeed, it plays a foundational role in the theory and computation of Bayesian inference.

5.1 Empirical Posteriors Are Dirac Mixtures

In Bayesian analysis, the posterior distribution \(\pi(\theta\mid x)\) encodes all inferential information about the parameter \(\theta\) given the data \(x\). In most real-world problems, \(\pi(\theta\mid x)\) is analytically intractable. As a result, we approximate it using samples from methods like Markov chain Monte Carlo (MCMC), producing a sequence \(\theta_1, \ldots, \theta_N \sim \pi(\theta\mid x)\).

Once a run is done, this gives rise to the empirical posterior distribution \[ \hat{\pi}_N = \frac{1}{N}\sum_{i=1}^N \delta_{\theta_i} \in \calD, \] which is a discrete measure (a Dirac mixture) and lies in the dense subset \(\calD \subset \calP_2 (\bbR^d)\). Thanks to the density result, under very mild regularity conditions, i.e., \(\pi(\theta\mid x) \in \calP_2(\bbR^d)\), we know: \[ W_2(\hat{\pi}_N, \pi) \longrightarrow 0\quad\text{as }N\to \infty. \]

This implies that the convergence of the empirical posterior \(\hat{\pi}_N\) to the true posterior \(\pi(\theta\mid x)\) is not just in distribution but also in terms of second moments, which is crucial for many statistical applications¹.

5.2 Why This Matters Practically

This convergence has powerful consequences.

Consistency of posterior expectations: For many function classes such as Lipschitz or quadratic-growth functions, expectations under \(\hat{\pi}_N\) converge to expectations under \(\pi\). This ensures reliable estimation of posterior summaries including means, variances, and credible intervals.
Stability of statistical functionals: Quantities like credible regions, predictive distributions, Bayes factors, and risk functionals are well-approximated by their empirical analogs.
Foundation for resampling and ensembling: The theoretical legitimacy of sampling-based posterior approximations ensures that tools like posterior bootstrap, weighted resampling, and kernel density estimation on \(\hat{\pi}_N\) can be interpreted rigorously.

6 Broader Implications

The density of finitely supported measures in Wasserstein space has far-reaching implications that extend well beyond Bayesian computation. At its core, this result gives us permission to treat discrete, empirical approximations as valid proxies for continuous distributions not only heuristically, but with full mathematical justification. This has become increasingly important as modern statistics moves into spaces where distributions are themselves the objects of inference, optimization, or learning.

For instance, a wide range of statistical and machine learning tasks now involve minimizing functionals over probability distributions. Examples include optimal transport-based clustering, distributional regression, and geometric learning in Wasserstein space. These problems are naturally infinite-dimensional, yet thanks to the density of point masses, we can safely approximate them using discrete measures. Algorithms that work with samples or particles, such as those in variational inference, particle filtering, or Wasserstein generative modeling, operate entirely within this discrete subset, and the density result assures us that they remain close, in a precise metric sense, to the true solution.

Moreover, this result forms the basis for understanding the behavior of empirical distributions in a quantitative way. In high-dimensional settings, we care not just about whether empirical averages converge, but about how distributions themselves behave under finite sampling. Rate of convergence results like Fournier and Guillin (2015), which provide non-asymptotic convergence rates for empirical measures in Wasserstein distance, rely on the fact that empirical distributions belong to a dense subset of the space. Without such a density property, these convergence rates would lack grounding.

It also enables the practical use of geometric concepts such as Wasserstein barycenters, Fréchet means, and transport-based medians. These objects, while defined over general distributions, are often computed from discrete empirical measures. That this is not just a computational shortcut but a theoretically valid procedure is precisely because of this density: the geometric structure of Wasserstein space remains stable under approximation by point masses.

Perhaps most importantly, the result builds a conceptual bridge between infinite-dimensional theory and finite-sample practice. In probability, statistics, and machine learning, we often formulate problems in terms of idealized objects - true distributions, functional risk, geometric flows - but ultimately implement solutions with finite data. The density of Dirac mixtures in Wasserstein space assures us that this transition from the infinite to the finite is not a compromise, but a convergence.

7 Conclusion

The fact that Dirac mixtures are dense in the Wasserstein space \(\calP_2(\bbR^d)\) is more than a technical result. It is a cornerstone of modern computational probability. It tells us that empirical approximations, such as those arising from MCMC, live in the same geometric space as the distributions they seek to estimate. It validates our use of discrete measures in place of continuous ones, not only for computing expectations or variances, but for solving full distributional problems.

This result is what allows us to define geometric estimators on sampled data, aggregate distributions across models or machines, and design optimization algorithms that operate directly on measures. It is what ensures that Wasserstein gradient flows, barycenters, and interpolation paths remain meaningful when applied to empirical distributions. It is what gives us the confidence to work with particles, while knowing we are still faithfully navigating the geometry of probability.

References

Fournier, Nicolas, and Arnaud Guillin. 2015. “On the Rate of Convergence in Wasserstein Distance of the Empirical Measure.” Probability Theory and Related Fields 162 (3-4): 707–38. https://doi.org/10.1007/s00440-014-0583-7.

Otto, Felix. 2001. “The Geometry of Dissipative Evolution Equations: The Porous Medium Equation.” Communications in Partial Differential Equations 26 (1-2): 101–74. https://doi.org/10.1081/PDE-100002243.

Villani, Cédric. 2003. Topics in Optimal Transportation. Vol. 58. Graduate Studies in Mathematics. S.l.: American Mathematical Society.

Footnotes

Let’s be honest, skewness and kurtosis, really?↩︎

Citation

BibTeX citation:

@online{you2025,
  author = {You, Kisung},
  title = {The Space of {Dirac} Measures Is Dense in {Wasserstein}
    Space},
  date = {2025-05-17},
  url = {https://kisungyou.com/Blog/blog_003_DiracDense.html},
  langid = {en}
}

For attribution, please cite this work as:

You, Kisung. 2025. “The Space of Dirac Measures Is Dense in Wasserstein Space.” May 17, 2025. https://kisungyou.com/Blog/blog_003_DiracDense.html.