PAC results with Hoeffding-Bernstein’s inequality are the bread and butter of machine learning theory. In this post, we’ll see how to use Hoeffding’s inequality to derive agnostic PAC bounds with an error rate that decreases as in number of training samples . Then, we’ll see that in the realizable setting we can use Bernstein’s inequality to improve this to .

### Supervised Learning

Consider the task of regression with input space and output space . We have access to a model class (also known as hypothesis family) . We will assume to be finite for simplicity. There exists some unknown data distribution that is generating examples. Of course, we don’t have access to this data distribution but we have access to some samples from it. Let be independent identically distributed (i.i.d) samples from the data distribution i.e., and probability of observing a sample is given by .

In order, to evaluate the performance of a model we will need a loss function. Let be a loss function. Given a datapoint and a model , the loss associated with our prediction is given by . *For now, we will not make any assumption about .*

Our aim is to find a function that is as good as possible. What is a good function? Well one that has the least loss function across the input space i.e., its generalization error is as low as possible: . The only signal we have for finding such a is the training data . We will define the training error . The aim of machine learning theory is to (i) allow us to find a “good” , and (ii) bound the performance of the predicted model in terms of and other relevant parameters. The first result we will look at is the Occam’s Razor Bound and then we will see how to improve this result.

**Occam’s Razor Bound**

One of the first result that is taught in a machine learning theory class is the Occam’s Razor bound. The formal result is given in the theorem below. Occam’s Razor bound suggests using the model that minimizes the error on the observed sample i.e., the solution of the following optimization: . This solution is called the *empirical risk minimizer* (ERM) or the ERM solution. We will drop the from the notation when it is clear from the context.

**Theorem:** Let be a supervised learning problem with input space , output space , data distribution , and be the loss function. We are given i.i.d samples from the data distribution . Let be a finite collection of models from to . Fix any then with probability at least over samples drawn i.i.d. from , we have:

, and

.

**Proof: **Fix any model in the model family . Let . Then observe and are i.i.d random variables. They are random variables with respect to the sample which is randomly chosen. Further, and

,

where the first equality follows from linearity of expectation and the second inequality follows from iid assumption. Then using Hoeffding’s inequality:

,

Setting , we get: with probability of at least . Let be result derived in the above equation then we have or in other words ( represents the complement of measurable set ). Even though we derived this result for a fixed , we never used any property of and therefore, it holds for all . Therefore, we can bound the probability of using union bound:

.

This means with probability of at least we have:

.

Replacing by proves the first result. Let be the ERM solution and fix in . Then:

(from Hoeffidng’s inequality)

(from the definition of ERM)

(from Hoeffidng’s inequality)

This implies, for any , with probability at least : . As this holds for any , therefore, we can pick the with smallest value of . This proves the second result.

There are couple of things that stand out in this result:

**More training data helps:**Formally, the bound improves as in sample size . Informally, this means as we have more training data we will learn a model whose generalized error has a better upper bound.**Complex model class can be undesirable:**The bound becomes weaker as the model family becomes larger. This is an important lesson in model regularization: if we have two model class that have the same training error on a sample , then if is more complex class than i.e., , then the bound on the generalization error for the later will be weaker. Note that the result does not say that the larger model class is always bad. We can get better bounds with a more complex model class provided we can appropriately improve the training error.

### Improving Occam’s Razor Bound with Bernstein’s Inequality

Bernstein’s inequality allows us to incorporate the variance of the random variables in the bound thereby allowing us to get tighter bound for low variance settings.

**Theorem (Bernstein’s Inequality):** Let be i.i.d random variables with mean and variance . Further, let almost surely, for all . Then:

If the variance was negligible then we will get . This bound, if we repeat what we did for Occam’s Razor Bound, will give us bound.

Before we proceed to derive a in realizable setting, we will first state the Bernstein’s inequality in the more familiar form. Setting the right hand side to and solving the quadratic equation gives us:

With probability at least we have:

We now apply the inequality to get:

.

**Theorem:** Let be a supervised learning problem with input space , output space , data distribution , and loss function . Let there be such that . We are given i.i.d samples from the data distribution . Let be a finite collection of models from to . Fix any then with probability at least we have:

.

**Proof: **Fix a and define as before. This time we will apply Bernstein’s inequality. We have as the loss function is bounded in . Bernstein’s inequality gives us:

,

with probability of at least . All we need to do is bound the variance . As the loss function is bounded in therefore, . This gives us . Plugging this in we get:

.

Applying AM-GM inequality, we have giving us,

,

with probability . Applying union bound on all hypothesis we get:

,

with probability at least . Let be the ERM solution then:

(Bernstein’s inequality)

(ERM solution)

(Bernstein’s inequality)

(realizability condition)

Combining these we get with probability at least . Hence, proved.

**So what is happening?** We are making a realizability assumption i.e., there is a function such that . As the loss is lower bounded by 0, therefore, the variance of the samples of its loss are 0. Now as our ERM solver is becoming better with more data, it is converging towards this whose variance is 0. Therefore, the variance of the ERM (which is bounded by its mean ) is becoming smaller. This helps in bounding the rate of convergence.

**Summary: **We saw how Hoeffding’s inequality gives us a bound in number of samples . When we are in the realizable case, we can get using Bernstein’s inequality.