Sunday, 17 January 2021

Probabilistic Ball

Uniform on  a Ball

When we work in statistics, higher dimensions can be quite problematic! Indeed, here is one simple example pertaining to uniforms in order to illustrate some potential issues that one may encounter! Indeed, as we will see later in more detail, when we take a ball, and increase its dimensions, most of its volume will tend towards the boundary, in particular most of its volume is concentrated on the skin of the ball with thickness \(O(1/n)\) when we are in \(n\) dimensions.

Take \(B^n = \{\textbf{t} \pmb{\mid} |\textbf{t}| \leq 1\}\) for \(\textbf{t} \in \mathbb{R}^n\). Say we have \(\textbf{X} \sim unif(B^n)\), and we are interested in knowing the distribution of \(|\textbf{X}|\). This will allow us to understand how far from the center our points lay. Note that if \(\boldsymbol{X}\) is uniform on some compact object \(\Gamma\) in euclidean space then:

$$
P(\boldsymbol{X} \in A) = \frac{vol(A I_{x \in \Gamma})}{vol(\Gamma)}
$$
In our case we have for \(0< r < 1\):
$$
P(|\textbf{X}| < r) = \frac{vol(rB^n)}{vol(B^n)} = \frac{r^n vol(B^n)}{vol(B^n)} = r^n
$$
Note that the volume of any compact object \(\Gamma\) in euclidean space \(R^n\) such that \(vol(\Gamma) = \gamma\) if shrunk by a coefficient \(r \leq 1\) then we have \(vol(r\Gamma) = r^n\gamma\). We can show this result is true by finding it in the case of a n-box and making a partition of our object \(\Gamma\) into infinitesimally small boxes. From here we can quickly find an interesting property of compact objects in euclidean space. Take our object \(\Gamma\) and some \(0 <\epsilon\) such that \(\epsilon = O(\frac{1}{n})\) then
$$
\frac{vol((1-\epsilon)\Gamma)}{vol(\Gamma)} = (1-\epsilon)^n \leq e^{-\epsilon n}
$$
recalling that \((1-\frac{1}{n})^n \rightarrow e^{-1}\) and we see this ratio rapidly converges to 0, indicating that most of the volume is in the skin of thickness \(O(\frac{1}{n})\).

The above allows us to understand that \(|\textbf{X}| \sim D_1(n ; 1)\) (quick brush up on the Dirichlet distribution perhaps?). From this we see that as the dimension of the ball increases, the points generated by the uniform tend to approach the skin of the ball. Moreover,
$$
E[|\textbf{X}|] = \frac{n}{n+1} \rightarrow 1 \text{ as } n \rightarrow \infty
$$
and
$$
\sigma(|\textbf{X}|) = \frac{1}{n+1}\sqrt{\frac{n}{n+2}} \rightarrow 0 \text{ as } n \rightarrow 0.
$$
This is (at least to me) is quite astonishing!

From here one might wonder if the volume of the n-ball converges to 0 as n increases, given some random vector in the ball is highly likely to be in the skin of the ball which could be a result of the skin having most of the volume on the n-ball whilst becoming narrower as we increase dimensions. We will see in the next adventure that the volume converges to 0 and \(vol(B^n) = \frac{2 \pi^{n/2}}{n\Gamma(n/2)}\). I now would like to give some insight on this behavior! The unit ball is inscribed in the unit cube (cube with sides of length 2). The volume of the n-cube is \(2^n\), which behaves in a quite opposite fashion to the volume of the n-ball. This means that most of the volume of the n-cube is concentrated on the vertices. Indeed, the distance from the center of the n-cube to one of its sides is always 1, whereas the distance to one of its vertices is \(\sqrt{n}\). So the cube actually becomes ("from a certain point of view": Obi-Wan Kenobi) spiky! This maybe helps visualize how these objects behave!

Before moving on, I would like to point out a curious and (maybe) intuitive fact: Take \(B^n = \{x \in \mathbb{R}^n \boldsymbol{|} |x| \leq 1\}\), if we have some hyperplane of \(n-1\) dimensions which is orthogonal to one of the diameters such that the distance between the center of \(rB^n\) and \((n-1)-\)hyperplane is \(0 < a < 1\), whilst defining the distance between two objects to be the shortest possible distance. The intersection of these two objects would be \(\xi B^{n-1}\) with \(\xi = \sqrt{1-a^2}\) and \(rB^n = \{x \in \mathbb{R}^n \boldsymbol{|} |x| \leq r\}\). I find this to be a very curious event, you can visualize a simple case with the sphere in three dimensions where you obtain a circle.

From the exploration on : Uniforms in a Ball, it could be interesting to derive a general formula for the volume of the n-ball. There are plenty of methods readily available online; one which however is of great curiosity is when we derive this volume from the properties of the Dirichlet distribution, establishing and elegant link between geometry and probability. The Dirichlet distribution is an extension into various dimensions of the Beta distribution. In Bayesian statistics, the Beta is a reasonable prior when we want to study a proportion, if there are multiple proportions to study, the Dirichlet comes in handy.

Dirichlet and the Volume of an N-Ball

NOTE: This section is dedicated to Professor David Brenner who taught the mathematical statistics class taught at the University of Toronto.

Lets quickly go over the Gamma distribution and how it relates to the Beta distribution. Say:
$$
X \sim Gamma(\alpha , \beta) \text{ then } f_{X}(x) = \frac{\beta ^ \alpha}{\Gamma (\alpha)}x^{\alpha -1} e^{-\beta x}
$$
with:
$$
\Gamma(\alpha) = \int_{0}^{\infty}t^{\alpha -1} e^{-t}dt
$$
and say:
$$
Z \sim Beta(\alpha , \beta) \text{ then } f_Z(z) = \frac{\Gamma (\alpha + \beta)}{\Gamma (\alpha) \Gamma (\beta)} z^{\alpha -1} (1-z)^{\beta -1}.
$$
Is there any way to construct a beta distribution out of gamma distributions? We will find that there is indeed a way, and from there we can generalize the Beta to the Dirichlet. Say we have \(X \sim Gamma(\alpha , 1)\) and \(Y \sim Gamma (\beta , 1)\) where \(X\) and \(Y\) are independent. We can therefore obtain the joint distribution of \(X\) and \(Y\) which is :
$$
f_{X,Y}(x,y) = \frac{1}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha -1} y^{\beta -1} e^{-(x+y)}.
$$
There seems to be a resemblance between this expression and the Beta distribution, in fact defining \(Z = \frac{X}{X+Y}\) and \(V = X + Y\), we obtain the joint distribution (using change of variables)
$$f_{Z , V}(z , v) = \frac{\Gamma (\alpha + \beta)}{\Gamma (\alpha) \Gamma (\beta)} z^{\alpha -1} (1-z)^{\beta -1} \frac{1}{\Gamma(\alpha + \beta)} v^{\alpha + \beta -1} e^{-v}.
$$
Here we notice that \(Z\) is independent from \(V\) given their joint distribution is separable and we find that:
$$
Z \sim Beta(\alpha , \beta).
$$
Allowing us to conclude that given \(X \sim Gamma(\alpha)\) and \(Y \sim Gamma(\beta)\), \(\frac{X}{X + Y} \sim Beta(\alpha , \beta)\).

From here we can make an extension to Dirichlet. Take \(X_i \sim (p_i), \forall i \in {1 , \cdots , n+1}\), denote \(\boldsymbol{X} = (X_1 ,\cdots , X_n)'\) and \(T = \sum_{i = 1}^{n+1}X_i\) then \(\boldsymbol{U} \sim D_n(p_1 , \cdots , p_n ; p_{n+1})\) and \(\boldsymbol{U} \stackrel{\text{d}}{=} \boldsymbol{X}/ T\) and from a similar argument as for the Gamma and Beta case, we can find that \(\boldsymbol{U}\) and \(T\) are independent, I would like to stress one point which I think is worth pointing out: the distribution \(\boldsymbol{X}/T\) is independent of its denominator, which is certainly a very unique property!

The support for the Beta distribution is \([0,1]\) and similarly for the Dirichlet, each coordinate can be at most 1, hence the support is the n-dimensional simplex \(T^n\), that is:
$$
T^n = \{ \boldsymbol{u} \in \mathbb{R} | \boldsymbol{u} > 0 , \boldsymbol{1'} \boldsymbol{u} <1\}
$$
The Dirichlet distribution can be expressed on its support as :
$$
f_U(u) = \frac{\Gamma (\sum_{i=1}^{n+1} p_i)}{\prod_{i=1}^{n+1} \Gamma (p_i)} \prod_{i=1}^{n}u_i^{p_i -1}(1-\sum_1^n u_i)^{p_{n+1} -1}
$$
The method to obtain this is similar as to the one in one variable.

Now back to things that relate to uniforms! The Dirichlet allows us to define distributions on the simplex, but how does this relate to the ball?
$$
\boldsymbol{X} \sim unif(B^n) \Leftrightarrow \text{ for $A \subset B^n$ }P(\boldsymbol{X} \in A) = \frac{vol(A)}{vol(B^n)}
$$
It is therefore important for us to be able to evaluate \(vol(B^n)\) We will notice something interesting, for some point in \(B^n\) its square is in \(T^n\), perhaps we can therefore find a link between the uniform on a ball, and the Dirichlet. Take \(\boldsymbol{X} \sim unif(B^n)\) and \(Y_i = X_i^2\), we can then obtain:
$$
P(\boldsymbol{Y} \leq \boldsymbol{t}) = P(\boldsymbol{X} \in [- \boldsymbol{t}^2 , \boldsymbol{t}^2]).
$$
The subset we are evaluating is a rectangle, hence we have :
$$
F_{\boldsymbol{Y}}(\boldsymbol{t}) = \frac{2^n \prod^n_1 t_i^{1/2}}{vol(B^n)}.
$$
Now we get the distribution:
$$
f_{\boldsymbol{Y}}(\boldsymbol{t}) = \frac{\prod_1^n t_i^{1/2 - 1}}{vol(B^n)},
$$
which is Dirichlet with \(p_i = 1/2\) for \(i \in [1 , \cdots , n]\) and \(p_{n+1} = 1\), noting that \(vol(B^n)\) is simply the normalizing constant for the distribution. We immediately get:
$$
vol(B^n) = \frac{\Gamma (1/2)^n}{\Gamma (n/2 +1)} = \frac{2 \pi^{n/2}}{n\Gamma(n/2)}.
$$

The Curse of Dimensionality: A simple case

One interesting way to continue on the uniforms generated on unit balls and their relation to the Dirichlet is by trying to understand how objects behave in high dimensions. We had previously noticed that the volume of the unit ball approached 0 as we increased dimensions. We had also seen that there seemed to be some concentration of measure on the boundaries of the ball and the corners of the cube. These behaviors of objects in higher dimensions are of great interest, in particular for Data Science and Machine Learning where high dimensional data is often used. These concepts are sometimes summarized as the curse of dimensionality.

We will now look at the following curiosity: Randomly generated vectors in high dimensional Euclidean space are likely to be almost orthogonal! This question is, however, one that needs to be approached carefully! One of the reasons for this is that measuring orthogonality can be done with the dot product, which is also influenced by the length of the vector. Maybe, in order to have a true view of orthogonality, we might have to make all our random vectors unit length allowing us to simply study the cosine.

Say we have three vectors \(u , v , w\) on the plane, which are uniformly distributed on the unit circle. We are interested in understanding how pairwise proximity within the couples : \((u,v)\) and \((v, w)\) affects the proximity of \((w , u)\). We will say that two vectors \(a\) and \(b\) are close if \(\langle a,b\rangle \geq 0\). We can explicitly formulate this question as so: If \(\langle u,v \rangle \geq 0\) and \(\langle v,w \rangle \geq 0\) what is the probability that \(\langle w,u \rangle \geq 0\) Notice that in this case, given the vectors are unit we have \(\langle a,b \rangle = cos(\theta)\) for \(\theta\) the angle between the vectors \(a\) and \(b\).

To solve this problem, we might first want to notice that if we take the vector \(u\) to be the first vector generated, it's position is irrelevant. What we do care about is the relative position of the vectors \(v\) relative to \(u\) and \(w\) relative to the others. Hence, we might be able to simplify this problem by taking the vector \(u\) to be fixed and we generate \(v\) uniformly so that \(\langle u,v \rangle \geq 0\) by generating a variable \(\theta \sim unif(\frac{-\pi}{2} , \frac{\pi}{2})\). And similarly for \(w\) by generating \(\xi \sim unif(\frac{-\pi}{2} , \frac{\pi}{2})\) and constructing \(v\) such that \(ang(u,v) = \theta\) and \(ang(v,w) = \xi\). We will notice now that \(ang(w,u) = \xi + \theta = \phi\). We therefore want to know \(P(\frac{- \pi}{2} \leq \phi \leq \frac{\pi}{2})\).

To find this we need to spend some time on the Irwin-Hall Distribution (sum of uniforms distribution) and realize our case is that of a linear transformation and \(n=2\).

\begin{equation}
f(\theta) =
\begin{cases}
\frac{1}{\pi} + \frac{\theta}{\pi^2}, & \text{if}\ \text{$-\pi \leq \theta \leq 0$} \\
\frac{1}{\pi} - \frac{\theta}{\pi^2}, & \text{if}\ \text{$0 \leq \theta \leq \pi$}
\end{cases}
\end{equation}

From which we notice that \(P(\frac{-\pi}{2} \leq \phi \leq \frac{\pi}{2}) = \frac{3}{4}\).

The objective behind this problem was to get ourselves used to working with orthogonality in just two dimensions, in later expeditions we will approach this problem of orthogonality in high dimensions and see why it naturally arises without artificially injecting it as in this problem.

You might be looking at this example and find it of little interest, it, however, has a very strong base in applications of machine learning. This problem is particularly relevant to web-page searching or classifying email as spam or not spam. One can make a count of the words in an email and present those as a vector \(\boldsymbol{x_i}\) (keeping the order of the words same each time). The problem would be to find some vector \(\boldsymbol{v}\) such that \(\boldsymbol{v}\cdot \boldsymbol{x_i} >0\) if not spam and \(\boldsymbol{v}\cdot \boldsymbol{x_i} < 0\) if the email is spam. In such a problem one can normalize the vectors as to have them on the unit sphere and the problem becomes quite similar to the one presented above. This problem gives rise to the Perceptron Algorithm.

No comments:

Post a Comment

Sufficiency, Completeness and Unbiasedness (UMVUE)

When you create an estimator for a parameter, one aspect of interest is its precision. That is, you want your estimator to as many times as ...