Introduction
Probability distribution with explicit forms of densities are core elements of statistical inference. In this post, we review angular central gaussian (ACG) distribution on a unit hypersphere \(\mathbb{S}^{p-1} \subset \mathbb{R}^p\) and its extension - matrix angular central gaussian (MACG) - defined on Stiefel \(St(p,r)\) and Grassman \(Gr(p,r)\) manifolds.
Angular Central Gaussian Distribution
On \(\mathbb{S}^{p-1}\), the ACG distribution \(ACG_p (A)\) as a density \[ f_{ACG} (x\vert A) = |A|^{-1/2} (x^\top Ax)^{-p/2} \] for \(x \in \mathbb{S}^{p-1}\) and \(A\) a symmetric positive-definite matrix, i.e., \(A=A^\top \in \mathbb{R}^{p\times p}\) with \(\lambda_{min}(A)>0\). Let’s recap some properties of ACG distribution.
Property 1. \(f_{ACG}(x|A) = f_{ACG}(-x|A)\). This enables ACG as a distribution on the real projective space \(\mathbb{R}P^{p-1} = \mathbb{S}^{p-1}/\lbrace +1, -1 \rbrace\).
Property 2. \(f_{ACG}(x|A) = f_{ACG}(x|cA),~c>0\). Common convention is to normalize the matrix \(A\) by a constraint \(\textrm{tr}(A) = p\), which is useful (or even essential) in maximum likelihood estimation of the parameter to ensure algorithmic stability. If you want to show this property, simply use the fact that \(|cA| = c^p|A|\).
Property 3. When \(x\sim \mathcal{N}_p (0,A) \rightarrow x/\|x\| \sim ACG_p (A)\). This property is indeed an intuition behind its origination (Tyler 1987), which can be used for sampling.
Maximum Likelihood Estimation
Given a random sample \(x_1, \ldots, x_p \sim ACG_p (A)\), Tyler (1987) proposed an iterative updating scheme to estimate the parameter \(A\) by
\[\hat{A}_{k+1} = p \left\lbrace \sum_{i=1}^n \frac{1}{x_i^\top \hat{A}_k^{-1} x_i} \right\rbrace^{-1} \sum_{i=1}^n \frac{x_i x_i^\top}{x_i^\top \hat{A}_k^{-1} x_i}, \tag{1}\]
where \(\hat{A}_k\) is the \(k\)-th iterate of an estimator with an initial starting point of an identity matrix \(\hat{A}_0 = I_p\). While Equation 1 guarantees the convergence under mild conditions and abides by the constraint \(\textrm{tr}(\hat{A}_k) = p\), it is from the author’s previous work on \(M\)-estimation of the scatter matrix. Here, we provide a naive derivation of 2-step fixed-point iteration algorithm for pedagogical purpose.
\[\hat{A}_{k'} = \frac{p}{n}\sum_{i=1}^n \frac{x_i x_i^\top}{x_i^\top \hat{A}_k^{-1} x_i}\,\,\textrm{and}\,\, \hat{A}_{k+1} = \frac{p}{\textrm{tr}(\hat{A}_{k'})} \hat{A}_{k'}. \tag{2}\]
First, let’s write the log-likelihood \[ \log L = -\frac{n}{2}\log\det(A) - \frac{p}{2} \sum_{i=1}^n \log (x_i^\top A^{-1} x_i), \] and recall two facts from matrix calculus (Petersen and Pedersen 2012) that \[ \frac{\partial \log\det(A)}{\partial A} = A^{-1}\,\,\textrm{and}\,\, \frac{\partial x^\top A^{-1} x}{\partial A} = -A^{-1}xx^\top A^{-1}. \] Then, the first-order condition for the log-likelihood can be written as
\[ \begin{gather*} \frac{\partial \log L}{\partial A} = -\frac{n}{2} A^{-1} + \frac{p}{2} \sum_{i=1}^n \frac{A^{-1} x_i x_i^\top A^{-1}}{x_i^\top A^{-1} x_i} \\ A^{-1} = \frac{p}{n} \sum_{i=1}^n \frac{A^{-1} x_i x_i^\top A^{-1}}{x_i^\top A^{-1} x_i} \\ A = \frac{p}{n} \sum_{i=1}^n \frac{ x_i x_i^\top }{x_i^\top A^{-1} x_i} \end{gather*} \] where the last equality comes from multiplying \(A\) from left and right. Therefore, \(\hat{A}\) is a solution of the matrix equation in a form \(X = f(X)\) where \(f\) is a contraction mapping under some conditions (Tyler 1987). This leads to Equation 2 while projection step is added to keep \(\text{tr}(\hat{A}_k) = p\) for all \(k=1,2,\cdots\).
Matrix Angular Central Gaussian Distribution
Chikuse (1990) extended the distribution to the matrix case, namely Stiefel and Grassmann manifolds \[ \begin{gather*} St(p,r) = \{X\in \mathbb{R}^{p\times r} ~\vert~ X^\top X = I_p\}\\ Gr(p,r) = \{\text{Span}(X) ~\vert~ X \in \mathbb{R}^{p\times r},~\text{rank}(X)=r\} \end{gather*} \] which are sets of orthonormal \(k\)-frames and \(k\)-subspaces. The Matrix Angular Central Gaussian (MACG) distribution \(MACG_{p,r}(\Sigma)\) has a density \[ f_{MACG}(X\vert \Sigma) = |\Sigma|^{-r/2} |X^\top \Sigma^{-1} X|^{-p/2} \] where \(\Sigma\) is a symmetric positive-definite matrix. Note that the density is very similar to what we had before for vector-valued distribution. Likewise, it shares properties as before.
- Property 1. \(f_{MACG}(X|\Sigma) = f_{MACG}(-X|\Sigma)\).
- Property 2. \(f_{MACG}(X|\Sigma) = f_{MACG}(X|c\Sigma),~c>0\).
- Property 3. \(f_{MACG}(X|\Sigma) = f_{MACG}(XR|\Sigma)\) for \(R\in O(r)\). This property enables to consider MACG as a distribution on Grassmann manifold, which are quotient by modulo orthogonal transformation.
Sampling from MACG
In order to draw random samples from \(MACG_{p,r}(\Sigma)\), we need the following steps, which are common in directional statistics with Stiefel/Grassmann manifolds . First, draw \(r\) random vectors \(x_1,\ldots,x_r \sim \mathcal{N}_p (0,\Sigma)\) and stack them as columns \(X=[x_1|\cdots|x_r] \in \mathbb{R}^{p\times r}\). Then, \[ Y = X (X^\top X)^{-1/2} \sim MACG_{p,r}(\Sigma) \] where the negative square root for a symmetric positive-definite matrix can be obtained from eigen-decomposition, \[ \begin{gather*} \Omega = UDU^\top \rightarrow \Omega^{-1/2} = UD^{-1/2} U^\top \\ \left[D^{-1/2}\right]_{ij} = \frac{1}{\sqrt{d_{ij}}} \textrm{ when } i = j \textrm{ and }0\textrm{ otherwise.} \end{gather*} \]
Maximum Likelihood Estimation
Similar to the ACG case, given a random sample \(X_1,X_2,\ldots,X_n \sim MACG_{p,r}(\Sigma)\), we can obtain a two-step iterative scheme to estimate the parameter \(\Sigma\),
\[\begin{gather*} \hat{\Sigma}_{k'} = \frac{p}{nr} \sum_{i=1}^n X_i (X_i^\top \Sigma^{-1} X_i)^{-1} X_i \\ \hat{\Sigma}_{k+1} = \frac{p}{\text{tr}(\hat{\Sigma}_{k'})} \hat{\Sigma}_{k'}.\end{gather*} \tag{3}\]
Derivation of formula Equation 3 follows the similar line of before. We need another fact from matrix calculus that \[ \frac{\partial }{\partial \Sigma} \log\det(X^\top \Sigma^{-1} X) = - \Sigma^{-1} X (X^\top \Sigma^{-1} X)^{-1} X^\top \Sigma^{-1}. \] First, log-likelihood is written as \[ \log L = -\frac{nr}{2}\log\det(\Sigma) - \frac{p}{2}\sum_{i=1}^n \log\det (X_i^\top \Sigma^{-1} X_i) \] where the first-order condition gives \[ \begin{gather*} \frac{\partial \log L}{\partial \Sigma} = -\frac{nr}{2}\Sigma^{-1} + \frac{p}{2}\sum_{i=1}^n \left( \Sigma^{-1} X_i (X_i^\top \Sigma^{-1} X_i)^{-1} X_i^\top \Sigma^{-1} \right)\\ \frac{nr}{2} \Sigma^{-1} = \frac{p}{2}\sum_{i=1}^n \left( \Sigma^{-1} X_i (X_i^\top \Sigma^{-1} X_i)^{-1} X_i^\top \Sigma^{-1} \right) \\ \Sigma = \frac{p}{nr} \sum_{i=1}^n X_i (X_i^\top \Sigma^{-1} X_i)^{-1} X_i^\top \end{gather*} \] where the last equality comes from multiplying \(\Sigma\) from left and right. Therefore, \(\hat{\Sigma}\) is a solution of the matrix equation, leading to the formula of Equation 3 with an additional projection step to keep \(\text{tr}(\hat{\Sigma}_k) = p\) for all \(k=1,2,\cdots\). Note that this matrix equation, up to my knowledge, has not known whether the mapping is contraction or not.
Conclusion
ACG and MACG distributions are simple yet rather little used in directional statistics. We hope that this brief note boosts probabilistic inference on corresponding manifolds at ease. An R package Riemann, which is also available on CRAN, implements density evaluation, random sample generation, and maximum likelihood estimation of the scatter parameters \(A\) and \(\Sigma\) in the light of expecting handy utilization of the distributions we introduced.
References
Citation
@online{you2022,
author = {You, Kisung},
title = {A {Note} on {Angular} {Central} {Gaussian} {Distribution} and
Its {Matrix} {Variant}},
date = {2022-08-12},
url = {https://kisungyou.com/posts/note003-angular-gaussian},
langid = {en}
}