Decision Theory - High Level Overview
In data science problems , we usually need to :
- Make a decision
- Take an action
- Produce some output
Also we have some evaluation criterion
Actions
An action is the generic term for what is produced by our system.
Examples of Actions
- Produce a 0/1 classification
- Reject hypothesis
Taking machine translation example the input could be an entire paragraph text in French and output could be corresponding translation in English is the action we will take producing the translation.
Evaluation Criterion
We need to evaluate the action in some way and the goal of decision theory is about finding “Optimal” actions under various definitions of optimality.
Examples of Evaluation criteria
- Is classification correct?
- Does text transcription exactly match the spoken words?
Formalizing a Business Problem
First two steps to formalizing a problem :
- Define the action space
- specify the evaluation criterion
Formalization may evolve gradually, as you understand the problem better.
Input
Most of problems have an extra piece which goes by various names :
- Inputs in Machine Learning
- Covariates in Statistics
Examples of inputs:
- A picture we get to determine whether its a cat or not.
- User past history and profile whether he/she is going to churn or not.
- A search query
Output
Also known by other name as Outcomes or Label. Its the right answer or what actually happens. Inputs are often paired with outputs or outcomes or labels.
Examples of inputs:
- Whether or not the picture actually contains an animal.
- User churned or not
In case of self driving car outcomes is not literally the right or wrong answer… its just an action you want to take for example you press the accelerator and steer the steering wheel and outcomes is how it moved or where is the position of the car in the world.
Typical sequence of events
Many problems domain can be formalized as follows :
- Observe an input x
- Take an action a
- Observe the outcome y
- Evaluate action in relation to the outcome : l(a,y)
Outcome y is often independent of action a but this is not always the case. For example in case of self driving the action you take affects the outcomes but in case of churn the outcome won’t change based on input. Basically you just predict and keep it to yourself and don’t act on it.
Formalization:
The Spaces
The three spaces :
- Input space \(X\)
- Action space \(A\)
- Output space \(Y\)
Whats are the spaces for linear regression?
Input space : \(\mathbb{R}^d\) , Action space : \(\mathbb{R}\) , Output space : \(\mathbb{R}\)
What are the spaces for logistic regression ?
Input space: \(\mathbb{R}^d\) , Action space: Probability between 0 & 1, Output space: [0,1]
Decision Function
A decision function (or prediction function) gets input x \(\in X\) and produces an action a \(\in A\) :
\[ f: X \rightarrow A \]
Loss Function
A loss function evaluates an action in the context of the outcome \(y\).
\[ l: A \times Y \rightarrow R (a, y) \rightarrow l(a, y) \]
Real Life : Formalizing a “Data Science” Problem
First two steps to formalizing the problem:
- Defining the action space (i.e. set of all possible actions)
- specifying the evaluation criterion
For example when a stakeholder ask the data scientist to solve a problem then she/he :
- may have an opinion on what the action space should be, and
- hopefully has an opinion on the evaluation criterion, but
- he/she really cares about your producing a good decision function
Evaluating a Decision Function
- Loss function \(l\) evaluates a single action
- How to evaluate the decision function as whole?
In order to deploy a decision function stake holder needs to know the score or assessment of how well its going to perform. So the standard framework called Statistical Learning Theory gives the recipe for that.
Statistical Learning Theory
SLT helps in providing the average performance that your decision function will get.
Setup for Statistical Learning Theory
Assumption in SLT: Action has no effect on the output.
Assume there is a data generating distribution \(P_{X \times Y}\).
All input/output pairs \((x,y)\) are generated i.i.d from \(P_{X \times Y}\).
i.i.d means “independent and identically distributed” practically it means
- no covariate shift
- no concept drift
Want decision function \(f(x)\) that generally “does well on average”:
- \(l(f(x),y)\) is usually small
How do we are going to formalize this ?
The Risk Functional
The risk is just a fancy name for expected loss on a new sample \((x,y)\) drawn randomly from \(P_{x \times y}\).
Definition:
The risk of a decision function \(f : X \rightarrow A\) is
\[ R(f) = \mathbb{E}[l(f(x),y)] \]
Risk function cannot be computed since we don’t know \(P_{x \times y}\), we cannot compute the expectation. But we can estimate it…
The Bayes Decision Function
There is notion of Bayes Decision Function which is the decision function which minimizes the risk. So its the best possible decision function when evaluated in terms of expected loss on randomly chosen data point.
Definition :
A Bayes decision function \(f^* : X \rightarrow A\) is a function that achieves the minimal risk among all possible functions:
\[ f^* = argmin R(f) \]
where the minimum is taken over all functions from \(X\) to \(A\).
The risk of a Bayes decision function is called the Bayes Risk.
A Bayes decision function is often called the target function, since it is the best decision function we can possibly produce.
Examples
Least Squares Regression
spaces : \(A = Y = R\)
square loss:
\[ l(a,y) = (a-y)^2 \]
- Risk : (which is expected value of loss on some sample)
\[ R(f) = \mathbb{E}[(f(x),y)^2] \]
- target function:
\[ f^*(x) = \mathbb{E}[y|x] \]
Multiclass Classification
spaces : $A = Y = {0, 1,….,K-1}
0-1 loss
\[ l(a,y) = 1(a \neq y):= \begin{cases} 1 \ if \ a \neq y \\ 0 \ otherwise \end{cases} \]
- risk is misclassification error rate
\[ R(f) = \mathbb{E}[1(f(x) \neq y] = 0.P(f(x) = y) + 1.P(f(x) \neq y) = P(f(x) \neq y) \]
1 in parentheis is indicator function that evaluates to 1 if thing inside parenthesis is true and 0 otherwise.
The risk of decision function is probability of error.
- target function is the assignment to the most likely class
\[ f^*(x) = argmax P(y = k |x) \]
k is the prediction and y is the output and predict the class that has maximum probability.
Can’t compute \(R(f) = \mathbb{E}l(f(x),y)\) because we don’t know \(P_{x \times y}\) because no one tells us the probability of distribution describing the real world.
One thing we can do in data science is assume we have some sample data.
Let \(D_n = ((x_1, y_1),....,(x_n, y_n))\) be drawn i.i.d from \(P_{x \times y}\)
Let’s draw some inspiration from the Strong law of large numbers:
if \(z, z_1,...Z_n\) are i.i.d with expected value \(Ez\) then
\[ \lim_{n\to\infty} \frac{1}{n} \sum_{i=1}^{n} Z_{i} = \mathbb{E}z \]
with probability 1.
Basically with enough data we can converge to property of distribution as whole.
The Empirical Risk Functional
So risk is expected loss, empirical risk is approximation of risk.
Let \(D_n = ((x_1, y_1),...,(x_n, y_n))\) be drawn i.i.d from \(P_{x \times y}\)
Definition: The empirical risk of \(f: X \rightarrow A\) with respect to \(D_n\) is
\[ \hat{R}_{n}(f) = \frac{1}{n} \sum_{i=1}^{n} l(f(x_i), y_i) \]
The loss are random because data points on which is calculated is random.
By Strong law of large numbers, the empirical risk converges to actual risk of the function.
\[ \lim_{n\to\infty} \hat{R}_{n}(f) = R(f) \] almost surely.
Empirical Risk Minimizer
We want risk minimizer, is empirical risj minimizer close enough?
Definition:
A function \(\hat{f}\) is an empirical risk minimizer if
\[ \hat{f} = argmin \hat{R}_{n}(f), \]
where the minimum is take over all functions
Constrained ERM
ERM led to a function \(f\) that just memorized the data. Then how can we spread information or generalize from training inputs to new inputs? We need to smooth things out somehow. So the one approach is Constrained ERM. In constrained ERM instead of minimizing empirical risk over all decision functions, constrain to particular subset called a Hypothesis Space.
Hypothesis Spaces
A hypothesis space \(F\) is a set of functions mapping \(X \rightarrow A\). It is the collection of decision function we are considering.
Constrained Empirical Risk Minimization
Empirical Risk Minimization (ERM) in \(F\) is
\[ \hat{f}_{n} = argmin \frac{1}{n} \sum_{i=1}^{n} l(f(x_i), y_i) \]