Sunday, 17 January 2021

Sufficiency, Completeness and Unbiasedness (UMVUE)

When you create an estimator for a parameter, one aspect of interest is its precision. That is, you want your estimator to as many times as possible (in expectation), get the right answer, but also you want your estimator to not wiggle allot, hence you want a small variance. The objective of this section is to illustrate the process to obtain such an estimator.

A statistic has for objective to summarize some information portrayed in the data. This implies that a statistic is a function of the data. Now lets assume data is generated from some distribution which is uniquely determined by a parameter (this can be a set). Hence we say: \(X \sim dP_{\theta_0}\) for some unique \(\theta_0 \in \Theta\) and \(X\) on its sample space \(\mathscr{X}\). From this we know:


$$
P_{\theta_1} = P_{\theta_2} \text{ iff } \theta_1 = \theta_2.
$$
Extending this we know that
$$
E_{\theta_1}[h(X)] = E_{\theta_2}[h(X)]\text{ iff }\theta_1 = \theta_2
$$
for any function \(h(x)\).

A possible objective would be to estimate \(\theta_0\) assuming some base distribution, such as a normal, where \(\theta_0 = (\mu_0 , \sigma_0)\). In order to find such an estimator we make a function \(f : \mathscr{X} \rightarrow \mathscr{Y}\) for \(Y = f(X)\) where \(Y\) is our statistic. Naturally, \(Y\) has its own distribution: \(Y \sim dQ_{\theta_0}\). Note however here the parameter might not be unique given the fact that a statistic attempts to capture as much information as possible but cannot capture all of it. Take as example \(X_1 , X_2\) \(IID\) \(N(\mu_0 , 1)\) however \(X_1 - X_2 \sim N(0 , 2)\) regardless of what \(\mu_0\) is!

Unbiasedness (or aiming for the Bullseye)

We now present the notion of bias. Here we deal with getting the right answer (in expectation) regardless of how volatile this answer is. A statistic \(Y\) is unbiased if \(E_\theta[Y] = g(\theta)\) for any \(\theta \in \Theta\) and any function \(g\). This is indeed an enjoyable property to have, it means that when you throw darts at the board, you are aiming for the bullseye, regardless of how good a thrower you are. On the flip side (the bias case), it could be that you are always aiming for the third ring and always manage to hit it, which would mean you have a biased estimator but low variance.

Sufficiency

A sufficient statistic \(Y\) on some subset \(H\) of \(\Theta\) is such that the conditional expectation of \(X\) given \(Y\) does not depend on \(\theta\) for any \(\theta \in H\). What is the point of this? The objective of this sufficiency, is that one can do an equally good job estimating the real \(\theta\) with either the sufficient statistic or the data. This indeed sounds like a pretty good bang for your buck! We can also say that anything is redundant to our sufficient statistic with respect to the parameter \(\theta\). We can extend the definition by saying: for any function \(h\), \(Y\) is sufficient over \(H \subset \Theta\) iff
$$
E_{\theta_1}[h(X)|Y] \stackrel{wP_{\theta}1}{=} E_{\theta_2}[h(X)|Y]
$$

for any \(\theta_1 , \theta_2 , \theta \in H\)

One appeal of sufficiency can be expressed as follows: \(Y \sim dQ_{\theta}\) and sufficient over \(H\). Given for any function \(g\) if \(Q_{\theta_1} = Q_{\theta_2}\)
$$\Longrightarrow E_{\theta_1}[g(X)]$$
$$\Longrightarrow E_{\theta_1}E_{\theta_1}[g(X)|Y] \qquad \text{ (double expectation)}$$
$$\Longrightarrow E_{\theta_1}E_{\theta_2}[g(X)|Y] \qquad \text{ (sufficiency)}$$
$$= E_{\theta_2}E_{\theta_2}[g(X)|Y] = E_{\theta_2}[g(X)]\Longrightarrow \theta_1 = \theta_2.$$
This implies that when our statistic is sufficient, the parameter \(\theta\) is unique! Note that the reciprocal is not always true, think of two observations out of a normal distribution and out statistic being the first observation.

Just with this definition, it can be a bit of extra work to find out if a statistic is sufficient. Fortunately, there is Neyman's factorization theorem!

Rao Blackwell

The Rao Blackwell theorem allows us to combine the notions of bias and sufficiency. The objective is that given a (perhaps crude) unbiased estimator and some sufficient statistic, we can combine them in order to obtain an improved unbiased estimator with smaller variance.

Here is the statement of the theorem: If we have \(U = u(X)\) an unbiased estimator for \(g(\theta)\) and \(Y = f(X)\) is sufficient for \(\theta\) then \(\hat{U} = \hat{u}(Y) = E_{\theta}[u(X) | Y = f(X)]\) is an unbiased estimator for \(g(\theta)\) such that \(var_{\theta}(\textbf{1}'\hat{U}) \leq var_{\theta}(\textbf{1}'U)\) for \(\textbf{1}\) some vector of same dimension as \(\theta\).

Lets prove this! Given the sufficiency of \(Y\) we know that:
$$
\hat{u}(Y) \stackrel{\text{$wP_{\theta}1$}}{=} E_{\theta}[u(X)|T] \forall \theta \in \Theta,
$$
which allows us to obtain unbiasedness
$$E_\theta[\hat{u}(Y)] = E_\theta E_\theta [u(X)|Y] = E_\theta [u(X)] = g(\theta)$$
Now lets see about the variance:
$$var_{\theta}\textbf{1}'u(X) = E_{\theta}[\textbf{1}'u(X) - \textbf{1}'g(\theta)]^2$$
$$=E_{\theta}[\textbf{1}'u(X) - \textbf{1}' E_{\theta}[u(X)|Y]]^2 + E_{\theta}[\textbf{1}'E_{\theta}[u(X)|Y] - \textbf{1}'g(\theta)]^2$$
$$= E_{\theta}[\textbf{1}'u(X) - \textbf{1}' E_{\theta}[u(X)|Y]]^2 + var_{\theta}(\textbf{1}'\hat{u}(Y)) \leq var_{\theta}(\textbf{1}'\hat{u}(Y))$$

Completeness

What we have developed so far is indeed quite good already! However take the following example in order to understand a weekness that we have not dealt with so far. Say we have a sufficient statistic \(Y\) from which we derive two different unbiased estimators for \(\theta\), \(g_1(Y)\) and \(g_2(Y)\). By unbiasedness we know that \(Eg_1(Y) = \theta\) and \(Eg_2(Y) = \theta\). From here we can ask how different are the two unbiased and sufficient estimators? We however still cannot say that \(P[g_1(Y) - g_2(Y) = 0] = 0\). Hence our estimators are not unique. Here is where completeness comes in! Completeness allows us to know that there is only one unbiased estimator of \(\theta\) relative to \(Y\)! Lets now set forth the definition of completeness : Our statistic \(Y = \phi(X)\) is complete iff \(E_{\theta}f(Y) = 0\) for any \(\theta \in \Theta\) then \(f(Y) \stackrel{wP_{\theta}1}{=} 0 \forall \theta \in \Theta\). Adding the concept of completeness allows us to create a statistic which is in a way optimal in order to estimate \(\theta\)

Lehmann-Scheffe

Now we are in our last step! Take two unbiased estimators of \(\theta\): \(f(X)\) and \(\phi(Y)\) with \(X\) our data and \(Y\) an complete and sufficient estimator for \(\theta\), then \(var_{\theta} \textbf{1}' \phi(Y) \leq var_{\theta} \textbf{1}' f(X)\).

We now finally have our Uniformily Minimum Variance Unbiased estimator in \(\phi(Y)\)!

For a proof of the Lehmann-Scheffe theorem, I will leave it as a short problem with the following hint: you can use the Rao Blackwell theorem or think about \(f(Y)\) the expectation of \(f(X)\) conditioned on \(Y\) and their variance.

Finding the UMVUE for the Discrete Uniform

When you create an estimator for a parameter, one aspect of interest is its precision. That is, you want your estimator to as many times as possible (in expectation), get the right answer, but also you want your estimator to not wiggle allot, hence you want a small variance. The objective of this expedition is to illustrate the process to obtain such an estimator. More explicitly, I wish to illustrate a method in order to obtain the Uniformly Minimum Variance Unbiased Estimator (UMVUE) for the Discrete Uniform. One of my motivations for writing this method up is that it is perhaps closer to a trick than a fundamental principle, hence worthwhile jotting down for future reference.

Let's take a first guess for what a good estimator for the distribution could be. One could propose that we estimate \(\theta\) using \(X_{(n)} = max_{i}(X_1 , \dots , X_n)\). We now need to evaluate if this is a good enough estimator for our distribution. The first step will be to see if it is unbiased.

Lets obtain the distribution for \(X_{(n)}\)

$$
F_{\theta} = P_{\theta}(X_{(n)} \leq t) = P(X_1 \leq t , \dots , X_n \leq t) = \prod_{i=1}^n P(X_i \leq t)
$$

Factorization is thanks to IID assumption

$$
= \prod_{i=1}^n \sum_{k=1}^\theta \frac{1}{\theta} I_{[k < t]} = (\frac{t}{\theta})^n
$$

Given this is a cumulative density function for a discrete distribution, we find the probability mass function by taking the difference of consecutive terms.

$$
p_{\theta}(t) = F_{\theta}(t) - F_{\theta}(t-1) = (\frac{t}{\theta})^n - (\frac{t-1}{\theta})^n
$$

We can now take the expectation for $X_{(n)}$

$$
E[X_{(n)}] = \sum_{i=1}^\theta t (F(t) - F(t-1))
$$
$$
= \theta F(\theta) - F(\theta -1) - \dots -F(1) - F(0)
$$
$$
= \theta - \sum_{i=1}^{\theta -1 }(\frac{i}{\theta})^n
$$

We have shown that \(X_{(n)}\) is a biased estimator for \(\theta\). Perhaps we can find a transformation on \(X_{(n)}\) in order to have an unbiased estimator.

We recall having found that \(p_{\theta}(t) = \frac{1}{\theta^n}(t^n - (t-1)^n)\). From which we know that \(\sum_{t=1}^{\theta}p_{\theta}(t) = 1 \Leftrightarrow \sum_{t=1}^{\theta}t^n - (t-1)^n = \theta^n\)

$$
\Longrightarrow \sum_{t=1}^{\theta} t^{n+1} - (t-1)^{n+1} = \theta^{n+1}
$$
$$
\Longrightarrow \sum_{t=1}^{\theta} \frac{t^{n+1} - (t-1)^{n+1}}{t^n - (t-1)^n} \frac{t^n - (t-1)^n}{\theta^n} = \theta
$$

We recognize the expression of the probability mass function for \(X_{(n)}\) in the second fraction in the sum. Hence We can re-express:

$$
\Longrightarrow E[\frac{t^{n+1} - (t-1)^{n+1}}{t^n - (t-1)^n}] = \theta
$$

Our unbiased estimator is therefore: \(\hat{\theta} = \frac{X_{(n)}^{n+1} - (X_{(n)}-1)^{n+1}}{X_{(n)}^n - (X_{(n)}-1)^n}\)

The question we now need to ask our selves is whether or not this estimator is UMVUE? By the use of the Lehmann-Scheffé theorem, we know that \(\hat{\theta}\) is UMVUE if \(X_{(n)}\) is a complete and sufficient statistic.

Sufficiency:

$$
p_\theta(X=x|X_{(n)} = t) = \frac{p_\theta(X=x , X_{(n) = t})}{p_\theta(X_{(n)} = t)} = \frac{p_\theta(X=x)}{p_\theta(X_{(n)} = t)} = \frac{1}{t^n - (t-1)^n}
$$

Completeness:

$$
Ef(T) = \sum_{t=0}^nf(t)[(\frac{t}{\theta})^n - (\frac{t-1}{\theta})^n] = 0
$$
$$
\Longrightarrow \sum_{t=0}^n f(t) (\frac{t}{\theta})^n = \sum_{t=0}^n f(t)(\frac{t-1}{\theta})^n
$$
$$
\Longrightarrow f(t) = 0 \space \text{ with probability 1}
$$

Now we know our estimator for \(\theta\) is indeed UMVUE.

No comments:

Post a Comment

Sufficiency, Completeness and Unbiasedness (UMVUE)

When you create an estimator for a parameter, one aspect of interest is its precision. That is, you want your estimator to as many times as ...