I recently came up with an idea. Insteading of summarizing lectures by hands, how about using markdown and post on blog?

Recap

By conditioning on the known value of the data y and using Bayes’ theorem, we yield the posterior density:

$p(\theta|\mathbf{y}) = \frac{p(\theta, \mathbf{y})}{p(\mathbf{y})} = \frac{p(\theta)p(\mathbf{y}|\theta)}{p(\mathbf{y})}$

where:

$p(\theta∣\mathbf{y})$ - is the posterior density for θ and represents the uncertainty about θ after conditioning on the data y.
$p(\theta)$ - is the prior density for θ that expresses our uncertainty about the values of θ before taking into account sample information (i.e. observed data).
$p(\mathbf{y}|\theta)$ - when regarded as a function of θ, for fixed y, is the likelihood function.
$p(\mathbf{y})$ - is the marginal density of the data y, normally written as:

$p(\mathbf{y}) = \underbrace{\sum_{\theta} p(\theta)p(\mathbf{y}|\theta)}_{\mbox{discrete} \ \theta} = \underbrace{\int_{\theta} p(\theta)p(\mathbf{y}|\theta) \ d\theta}_{\mbox{continuous} \ \theta}$

Or, by omitting the normalizing constant $p(\mathbf{y})$ ,we have the unnormalized posterior density:

$\begin{align*} p(\theta∣\mathbf{y}) & \propto p(\theta) \cdot p(\mathbf{y}|\theta) \\ & \propto \text{prior pdf ⋅ likelihood function} \end{align*}$

The parameters that controls prior distrition are called hyperparameters.

In a Bayesian analysis, we first need to represent our prior beliefs about θ, by constructing a probability distribution p(θ) which encapsulates our beliefs.

p(θ) will not be the same for different people as they may have different knowledge about what proportion of coins are biased.

In some cases, p(θ) may be based on subjective judgement, while in others it may be based on objective evidence. This is the essence of Bayesian statistics - probabilities express degrees of beliefs.

Decision Theory

Bayesian decision theory

- concerned with making decisions that perform best, based on the information we have about the un- knowns.

Basic Elements of a Decision Problem

$\Theta$ is the parameter space which consists of all possible “states of nature” or “states of the world” $\theta$ , only one of which will occur. The “true” state of nature $\theta$ is unknown.
$A = (a_1,a_2,...,a_k)$ is the action space, which is the set of all pos- sible actions available, $a \in A$ .
$\Omega$ contains all possible realisations $y \in \Omega$ of a random variable $Y$ which belongs to the family $\{ f(y; \theta)∣ \theta \in \Theta \}$
$L(\theta, a)$ is a loss function that has the domain $Θ \times A = \{ (θ, a)∣ θ ∈ Θ \ \text{and} \ a ∈ A \}$ and codomain $\mathbb{R}$ . That is, a loss function maps each combination of states of the world θ and action a onto a numerical loss, R. For technical convenience, $L(θ, a) ≥ −K > −∞$

Loss Function

The loss function $L(\theta,a)$ is a core element of decision making which represents the loss incurred if we choose action a when the true state of the world is $\theta$ (usually unknown).

The losses corresponding to each action and state of world θ can be represented by a loss matrix:

	$\theta = 0$	$\theta = 1$
$a_1$	0	10
$a_0$	1	0

which fully specifies the loss function $L(\theta,a)$ for all values of θ and a.

If $\pi^*(\theta)$ is the believed probability distribution of θ at the time of decision making, the Bayesian expected loss of an action $a$ is:

$\rho(\pi^*, a) = E^{\pi^*}L(\theta, a) = \int_{\Theta}L(\theta, a)\ dF^{\pi^*}(\theta)$

where $F$ is the c.d.f of the random variable X.

The conditional Bayes Principle

Choose an action $a \in A$ which minimizes $\rho(\pi^*, a)$ (assuming the minimum is attained). Such an action will be called a Bayes action and will be denoted $a^{\pi^*}$

Note: In a multiclass classification problem, we shall use the One vs Rest method to keep things simple.

Frequentist Risk

A decision rule $\delta(x)$ is a function from $\Omega$ into $A$ . Given a particular realization $X = x$ , $\delta(x)$ is the action that will be taken.

Two decision rules, $\delta_1$ and $\delta_2$ , are said to be equivalent if $P_\theta (\delta_1(X) = \delta_1(X)) = 1$ for all θ.

The risk function of a decision rule $\delta(x)$ is defined by:

$R(\theta, \delta) = E^X_\theta[L(\theta, \delta(x))] = \int_\Theta L(\theta, \delta(x)) \ dF^X(x|\theta)$

It is natural to use a decision rule $\delta(x)$ which has smallest risk $R(\theta, \delta)$ . However, in contrast to the Bayesian expected loss, the risk is a function of θ, and hence it is not a single number.
Since θ is unknown, the meaning of “smallest” is not clearly defined, so we need another way to choose decision.

A decision rule $\delta_1$ , is R-better than a decision rule $\delta_2$ , if $R(\theta, \delta_1) ≤ R(\theta, \delta_2)$ for all $\theta ∈ \Theta$ , with strict inequality for some θ. A decision rule $\theta_1$ , is R-equivalent to a decision rule $\theta_2$ , if $R(\theta, \delta_1) = R(\theta, \delta_2)$ for all $\theta ∈ \Theta$ .

A decision rule $\delta$ is said to be admissible if there does not exist R-better decision rule. A decision rule $\delta$ is inadmissible if there does exist an R-better decision rule.

It’s clear that we shall never use an inadmissible decision rule, but the class of admissible decision rules for a given decision problem can be large. This means that there will be admissible rules with risk functions $R(\theta, \delta)$ that are “better” in some regions of the parameter space $\Theta$ , and “worse” in others, i.e. risk functions cross.

Randomized decision rule

So far, we have considered deterministic decision rules.
That is, given a particular realization $X = x$ , a deterministic decision rule $\delta(x)$ is a function from $\Theta$ into $A$ . However, imaging that we are competing with an intelligent competitor, then decisions will have to be taken in a randomised manner.

A randomized decision rule $\delta^∗(x,⋅)$ is a probability distribution on A. That is, given that $X = x$ is observed, $\delta^∗(x, a)$ is the probability that an action in $a ⊆ A$ will be chosen.
Note: deterministic decision rules can be considered as a special case of randomized rules.

In the absence of data, a randomized decision rule is also called a randomized action, which is denoted as $\delta^∗(⋅)$ . It is also a probability distribution on $A$ .

Similar to before, the loss function $L(\theta,\delta^∗(x))$ of the randomized rule $\delta_∗(x,⋅)$ is:

$L(\theta, \delta^∗(x)) = E^{\delta^*(x,⋅)}[L(\theta, a)]$

And the risk function $R(\theta, \delta^∗)$ of $\delta^*(x,⋅)$ with the loss function L is:

$R(\theta, \delta^∗) = E^X_\theta[L(\theta, \delta^∗(x))] = \int_\Theta L(\theta, \delta^∗(x)) \ dF^X(x|\theta)$

For a no-data decision problem, we have $R(\theta, \delta^∗) = L(\theta, \delta^∗(x))$ .

Usefulness of randomized decision

How often do decision problems involve an intelligent opponent?
Whenever possible, each possible action has to be evaluated in order to find the optimal action:
- If there is only one optimal action, then randomizing is of limited use.
- If there are 2 or more optimal actions, one could potentially choose at random, although the usefulness of doing so is questionable.

Frequentist Decision Principles

We have seen that using risk functions to select a decision rule does not always produce a clear final choice. To overcome this limitation, we must introduce additional principles in order to select a specific decision rule.

The Bayes Risk Principle

The Bayes risk of a decision rule $\delta$ , with respect to a prior distribution $\pi$ on $\Theta$ , is defined as:

$r(\pi,\delta) = E_\pi[R(\theta,\delta)]$

A decision rule $\delta_1$ is preferred to a rule $\delta_2$ if:

$r(\pi,\delta_1) < r(\pi,\delta_2)$

A decision rule is said to be optimal if it minimizes $r(\pi,\delta)$ . This decision rule is called a Bayes rule, and will be denoted $\delta^\pi$ .
The quantity $r(\pi) = r(\pi, \delta^\pi)$ is then called the Bayes risk for $\pi$ .

The Minimax Principle

Let $\delta^* \in D^*$ be a randomized decision rule, then the worst case possible using this decision rule $\delta^*$ is:

$\sup_{\theta \in \Theta}R(\theta, \delta^*)$

In order to protect from the worst case scenario, one should use the minimax principle.

The Minimax Principle:
A decision rule $\delta^*_1$ is preferred to a rule $\delta^*_2$ if

$\sup_{\theta \in \Theta}R(\theta, \delta^*_1) < \sup_{\theta \in \Theta}R(\theta, \delta^*_2)$

A decision rule $\delta^{*M}$ is a minimax decision rule if it minimizes $\sup_{\theta \in \Theta}R(\theta, \delta^*)$ among all randomized rules in $D^*$ , that is, if:

$\sup_{\theta \in \Theta}R(\theta, \delta^*_M) = \inf_{\delta^* \in D^*} \sup_{\theta \in \Theta}R(\theta, \delta^*)$

For a no-data decision problem, the minimax decision rule is simply called the minimax action.

Oalvay's Blog

Oalvay's Blog

Decision and Risk - Part 1

Decision Theory

Bayesian decision theory

Basic Elements of a Decision Problem

Loss Function

The conditional Bayes Principle

Frequentist Risk

Randomized decision rule

Usefulness of randomized decision

Frequentist Decision Principles

The Bayes Risk Principle

The Minimax Principle

Oalvay's Blog

Decision Theory

Bayesian decision theory

Basic Elements of a Decision Problem

Loss Function

The conditional Bayes Principle

Frequentist Risk

Randomized decision rule

Usefulness of randomized decision

Frequentist Decision Principles

The Bayes Risk Principle

The Minimax Principle

Notes