I recently came up with an idea. Insteading of summarizing lectures by hands, how about using markdown and post on blog?
Recap
By conditioning on the known value of the data y and using Bayes’ theorem, we yield the posterior density:
where:
- - is the posterior density for θ and represents the uncertainty about θ after conditioning on the data y.
- - is the prior density for θ that expresses our uncertainty about the values of θ before taking into account sample information (i.e. observed data).
- - when regarded as a function of θ, for fixed y, is the likelihood function.
- - is the marginal density of the data y, normally written as:
Or, by omitting the normalizing constant ,we have the unnormalized posterior density:
The parameters that controls prior distrition are called hyperparameters.
In a Bayesian analysis, we first need to represent our prior beliefs about θ, by constructing a probability distribution p(θ) which encapsulates our beliefs.
p(θ) will not be the same for different people as they may have different knowledge about what proportion of coins are biased.
In some cases, p(θ) may be based on subjective judgement, while in others it may be based on objective evidence. This is the essence of Bayesian statistics - probabilities express degrees of beliefs.
Decision Theory
Bayesian decision theory
- concerned with making decisions that perform best, based on the information we have about the un- knowns.
Basic Elements of a Decision Problem
is the parameter space which consists of all possible “states of nature” or “states of the world” , only one of which will occur. The “true” state of nature is unknown.
is the action space, which is the set of all pos- sible actions available, .
contains all possible realisations of a random variable which belongs to the family
is a loss function that has the domain and codomain . That is, a loss function maps each combination of states of the world θ and action a onto a numerical loss, R. For technical convenience,
Loss Function
The loss function is a core element of decision making which represents the loss incurred if we choose action a when the true state of the world is (usually unknown).
The losses corresponding to each action and state of world θ can be represented by a loss matrix:
0 | 10 | |
1 | 0 |
which fully specifies the loss function for all values of θ and a.
If is the believed probability distribution of θ at the time of decision making, the Bayesian expected loss of an action is:
where is the c.d.f of the random variable X.
The conditional Bayes Principle
Choose an action which minimizes (assuming the minimum is attained). Such an action will be called a Bayes action and will be denoted
Note: In a multiclass classification problem, we shall use the One vs Rest method to keep things simple.
Frequentist Risk
A decision rule is a function from into . Given a particular realization , is the action that will be taken.
Two decision rules, and , are said to be equivalent if for all θ.
The risk function of a decision rule is defined by:
It is natural to use a decision rule which has smallest risk . However, in contrast to the Bayesian expected loss, the risk is a function of θ, and hence it is not a single number.
Since θ is unknown, the meaning of “smallest” is not clearly defined, so we need another way to choose decision.
A decision rule , is R-better than a decision rule , if for all , with strict inequality for some θ. A decision rule , is R-equivalent to a decision rule , if for all .
A decision rule is said to be admissible if there does not exist R-better decision rule. A decision rule is inadmissible if there does exist an R-better decision rule.
It’s clear that we shall never use an inadmissible decision rule, but the class of admissible decision rules for a given decision problem can be large. This means that there will be admissible rules with risk functions that are “better” in some regions of the parameter space , and “worse” in others, i.e. risk functions cross.
Randomized decision rule
So far, we have considered deterministic decision rules.
That is, given a particular realization , a deterministic decision rule is a function from into . However, imaging that we are competing with an intelligent competitor, then decisions will have to be taken in a randomised manner.
A randomized decision rule is a probability distribution on A. That is, given that is observed, is the probability that an action in will be chosen.
Note: deterministic decision rules can be considered as a special case of randomized rules.
In the absence of data, a randomized decision rule is also called a randomized action, which is denoted as . It is also a probability distribution on .
Similar to before, the loss function of the randomized rule is:
And the risk function of with the loss function L is:
For a no-data decision problem, we have .
Usefulness of randomized decision
- How often do decision problems involve an intelligent opponent?
- Whenever possible, each possible action has to be evaluated in order to find the optimal action:
- If there is only one optimal action, then randomizing is of limited use.
- If there are 2 or more optimal actions, one could potentially choose at random, although the usefulness of doing so is questionable.
Frequentist Decision Principles
We have seen that using risk functions to select a decision rule does not always produce a clear final choice. To overcome this limitation, we must introduce additional principles in order to select a specific decision rule.
The Bayes Risk Principle
The Bayes risk of a decision rule , with respect to a prior distribution on , is defined as:
A decision rule is preferred to a rule if:
A decision rule is said to be optimal if it minimizes . This decision rule is called a Bayes rule, and will be denoted .
The quantity is then called the Bayes risk for .
The Minimax Principle
Let be a randomized decision rule, then the worst case possible using this decision rule is:
In order to protect from the worst case scenario, one should use the minimax principle.
The Minimax Principle:
A decision rule is preferred to a rule if
A decision rule is a minimax decision rule if it minimizes among all randomized rules in , that is, if:
For a no-data decision problem, the minimax decision rule is simply called the minimax action.