The ability of a machine learning model to generalize on unseen data is a critical performance measure that ensures the reliability of the model. Even though assessing the generalization ability of a model is an intractable problem, it is possible to derive bounds over the error the model produces on new data. In this work, we will study generalization properties of loss functions. We will define theoretical tools that we will use to construct bounds over the generalization performances of loss function for classification tasks.
1.1. The supervised learning problem
Given a labeled dataset \((x^{(i)}, y^{(i)}) \in X \times Y\) of input-output pairs, the statistical learning task aims to find a mapping between the input space \(X\) and the output space \(Y\) that satisfies a certain metric relative to the problem. This mapping should generalize well on new unseen data for any instance drawn from the same data distribution as the training dataset.
To this end, we assume access to a set of i.i.d. samples drawn from an unknown joint probability distribution \(D_{\rx\ry}\), in order to find a scoring function \(f \in F\) that minimizes a loss metric \(\ell : Y \times W \rightarrow \mathbb{R}_{+}\) representing the prediction error. Where \(F\) is called the hypothesis space and \(f\) maps the input space \(X\) to a scoring space \(W\) that varies depending on the problem. The scores of \(f\) are transformed to actual predictions using a read-out or prediction function such as a sigmoid or softmax, among others. \(F\) being the set of parameterized functions, is typically very large, and we use optimization algorithms such as gradient descent with respect to the function continuous parameters to find a suitable scoring function that minimizes the average loss value over a training dataset, this approach is known as empirical risk minimization.
In this work, we will focus on classification problems only. We will start by studying binary classification and then generalize to the multiclass classification case.
In the case of binary classification, the label space is reduced to \(Y = \{0, 1\}\), and the scoring function image is in \(\mathbb{R}\). This score represents the positive class score and is mapped using sigmoid to \([0, 1]\) interval.
A typical loss function for binary classification is the 0-1 loss \(\ell_{0-1} (y, w) = \mathbb{1}_{w(2y - 1) < 0}\) which assigns 1 to misclassifications and 0 otherwise.
But considering now the optimization problem, minimizing the empirical risk associated to the 0-1 loss function is difficult since its gradient is null almost everywhere. Moreover, it is not continuous and non-convex. Therefore, for optimization purposes, it is usual to rely on convex surrogate loss functions that are convex upper bounds of the 0-1 loss. Some examples of convex surrogates are listed below as well as their respective plots in figure (1):
Exponential loss: \(\ell_{\exp} (y, w) = e^{-w(2y - 1)}\) used by AdaBoost algorithm.
Hinge loss: \(\ell_{hinge} (y, w) = \max(0, 1 - w(2y - 1)\) used in SVMs.
Figure 1: Common loss functions
2. Risk
The risk, or the true risk of a scoring function, is the expected value of the loss function over the data distribution \(D\) and is written as such:
$$ \risk_{\ell} (f) = \E_{(\rx, \ry) \sim D} [ \ell (\ry, f(\rx)) ] $$
This formulation depends only on the unknown data distribution. Hence, the optimization task uses the empirical risk as a proxy for the generalization risk, which is defined as the following:
$$ \erisk_{\mathcal{S}, \ell} (f) = \frac{1}{N} \sum_{i=1}^{N} \ell (y^{(i)}, f (x^{(i)})) $$
where \(\mathcal{S}\) is a dataset \(\{( x^{(i)}, y^{(i)} )\}_{1 \leq i \leq N}\) drawn from the distribution \(D\).
In both the true and empirical risk, if the loss function is omitted, we assume that it is the 0-1 loss that is used by default.
2.1. Bayes risk
Bayes risk \(\risk^{*}\) is the minimum value associated with the 0-1 loss function of the true risk \(\risk (f)\) and it is the risk of a optimal Bayes predictor \(f^{*}\) such that
$$ \risk^{*} = \inf_{f} \risk (f) = \inf_{f} \E_{(\rx, \ry) \sim D} [ \ell_{0-1} (\ry, f(\rx)) ] $$
where \(f\) is a measurable function. Using the law of total expectation, and by partitioning the previous expectation with respect to \(\rx\), we can rewrite the risk as an expression of the conditional expectation of \(\ry\) knowing \(\rx\):
$$ \risk (f) = \E_{\rx \sim D_{\rx}} \E_{\ry \sim D_{\ry}} [ \ell_{0-1} (\ry, f(x)) | \rx = x ] = \int_{x} \E_{\ry \sim D_{\ry}} [ \ell_{0-1} (\ry, f(x)) | \rx = x ] p(x) dx $$
Now, considering that different \(x\) are independently treated, we can rewrite the previous Bayes risk formula as [2]:
$$ \risk^{*} = \E_{\rx \sim D_{\rx}} \inf_{f} \E [ \ell_{0-1} (\ry, f(x)) | \rx = x ] $$
There may be more than a single Bayes estimator for a prediction problem, and the Bayes risk associated is non zero unless there exists a deterministic dependency between \(x\) and \(y\).
Bayes estimators are more than often arbitrary functions, and the hypothesis space fails to capture them (they cannot be easily modeled with typical parametrized model). Thus, the Bayes risk is the theoretical minimum of the true risk and is usually not reached by the optimization solution \(\risk^{*} \leq \risk (f) \quad \forall f \in F\).
In the binary case, if we denote \(\eta (x) = \mathbb{P} (\ry = 1 | \rx = x)\) then we can write Bayes risk as \(\risk^{*} = \E [\min(\eta(x), 1 - \eta(x))]\). For the multiclassification case, and if we note \(\eta (x) = [\eta_{1} (x), ..., \eta_{k} (x)]^{T}\) where \(\eta_{i} (x) = \mathbb{P} (\ry = i | \rx = x)\), then \(\risk^{*} = \E [1 - \max_{1 \leq i \leq k} \eta_{i} (x)]\).
2.2. Excess risk
The excess risk of a predictor function \(f\) is equal to \(\risk (f) - \risk^{*}\) which is non-negative. \(\risk (f)\) is the true risk relative to the 0-1 loss, unless assumed otherwise using indices.
This risk represents the risk of a predictor function over the whole distribution relative to the baseline represented by the theoretical minimum risk, the Bayes risk.
We can define Bayes risk for convex surrogates \(\risk^{*}_{\ell}\) by replacing the 0-1 loss function with its surrogate. And by doing so, we can try to upper bound the excess risk relative to the 0-1 loss with the excess risk of its surrogates:
$$ \risk (f) - \risk^{*} \leq H (\risk_{\ell} (f) - \risk^{*}_{\ell}) $$
where \(H: \mathbb{R}_{+} \rightarrow \mathbb{R}_{+}\) is usually called, if it exists, a calibration function. \(H\) depends on the choice of the convex surrogate and it is trivial that the tighter it is the calibration function, the more precise is the excess risk estimation [2]. This ensures that a minimizer of the convex surrogate will also minimize the 0-1 loss to a certain order that is expressed by this calibration function.
2.3. Excess risk decomposition
Let us define \(\hat{f} \in F\) as \(\hat{f} \in \argmin_{f \in F} \erisk (f)\) the minimizer of the empirical risk. We can then decompose the excess risk of this target function \(\hat{f}\) as such:
$$
\risk (\hat{f}) - \risk^{*} =
\underbrace{\big(\quad \risk (\hat{f}) - \inf_{f' \in F} \risk (f') \quad\big)}_{\text{estimation error}}
+ \underbrace{\big(\quad \inf_{f' \in F} \risk (f') - \risk^{*} \quad\big)}_{\text{approximation error}}
$$
Here the approximation error reflects how the chosen model family is different from the Bayes estimator. Where as for the estimation error, it represents the generalization error of the target function obtained from the training dataset.
The error most of the time expand beyond the estimation and approximation errors only. Optimization algorithms, especially for large hypothesis spaces, do not find \(\hat{f}\) but \(\tilde{f}\) corresponding to a local minima. This is true especially in the case of deep neural networks, and although it is trivial to find the optimum for certain simple models, it is safer to assume that optimization results do not exactly correspond to that optimum. This being said, this difference is only experimental (depends on the initialization, the optimization algorithm and the type of model), so it is ignored in what follows. Then it is interesting to rewrite the excess risk decomposition of \(\tilde{f}\) as:
$$
\risk (\tilde{f}) - \risk^{*} =
\underbrace{\big(\quad \risk (\tilde{f}) - \risk (\hat{f}) \quad\big)}_{\text{optimization error}}
+ \underbrace{\big(\quad \risk (\hat{f}) - \inf_{f' \in F} \risk (f') \quad\big)}_{\text{estimation error}}
+ \underbrace{\big(\quad \inf_{f' \in F} \risk (f') - \risk^{*} \quad\big)}_{\text{approximation error}}
$$
The figure 2 illustrates how those three errors are related.
Figure 2: Different errors of target function
We can draw a parallel between the estimation/approximation error decomposition and the variance/bias error decomposition;
If we reduce the bias by using bigger hypothesis space, then the smaller the approximation error becomes, while increasing the variance or the estimation error.
So given an optimization problem and a family of target functions (hypothesis space), and if we ignore the marginal optimization error, we are trying to minimize the estimation error of our target function, which represents the additional error accumulated on the whole data distribution.
An important property resulting from the previous definition of Bayes risk is the classification-calibration1 of a convex surrogate \(\ell\). We need to ensure that the solution of the optimization problem based on the convex surrogate leads or converges (given enough data) to the Bayes classifier; that is, the optimal Bayes decision rule can be deduced by minimizing the surrogate \(\ell\) as a convex upper bound proxy for the 0-1 loss.
Let \(f^{*}_{\ell}\) be the minimizer of the true risk relative to \(\ell\). \(\ell\) is said to be classification-calibrated if \(f^{*} = sign(f^{*}_{\ell})\). The following th. 1 provides a simple sufficient condition to prove consistency for the binary case [4].
Theorem 1
Let \(\ell\) (in magrin-based form2) be convex. Then \(\ell\) is classification-calibrated if and only if it is differentiable at \(0\) and \(\ell ' (0) < 0\).
This condition excludes the perceptron loss \(\ell_{perceptron}(y, w) = \max(0, -w (2y - 1))\) since it is not differentiable at \(0\).
It is trivial to see that the minimizer of the conditional risk relative to the perceptron loss
\begin{align}
& \E_{\ry \sim D_{\ry}} [ \ell_{perceptron} (\ry, f(x)) | \rx = x ] \nonumber \\
= \quad & p(\ry = 1 | \rx = x) \max(0, -f(x)) \quad + \quad p(\ry = 0 | \rx = x) \max(0, f(x)) \nonumber
\end{align}
is the function \(f(x) = 0\), which leads to different decision rule than the Bayes estimator.
3.1. Example of non classification-calibrated multiclass loss function
Let \(\ell (y, w) = \left [ 1 - (w_{y} - \min_{i} w_{i}) + \sum_{j \neq y} (w_{j} - \min_{i} w_{i}) \right ]_{+}\) where \(\left [ x \right ]_{+}\) is the \(\max (0, x)\) operation. This assures that the gold score is at least higher than the sum of the other scores \(+ 1\). We can imagine that this will force our evaluation to be a One-vs-rest binary evaluation.
Lemma 1
The minimizer \(f^{*}\) of \(\E \left [ \ell(\ry, f(x)) | \rx = x \right ]\) has the following properties:
If \(\max_{i} P(\ry = i | \rx = x) > 1/2\), then \(\argmax_{i} P(\ry = i) = \argmax_{i} f_{i} (x)\).
If \(\max_{i} P(\ry = i | \rx = x) < 1/2\), then \(\forall i \; f_{i} (x) = \min_{j} f_{j} (x)\).
Proof in the appendix.
We can see from lemma 1 that the minimizer of the risk associated with this loss does not correspond always to the Bayes predictor (in the case of \(\max_{i} P(\ry = i | \rx = x) < 1 / 2\)).
4. Generalization error
Definition
Given a dataset \(\mathcal{S} = \{(x^{(i)}, y^{(i)})\}_{1 \leq i \leq N}\), a target function \(f\) and a loss function \(\ell\), the generalization error is the difference between the true risk and empirical risk \(\E (f, \mathcal{S}, \ell) = \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell)\).
4.1. Generalization error bound
A generalization bound gives an estimate of the maximum generalization error for a given hypothesis space and a given loss function. And as we saw previously, by studying the generalization error bound we get an upper bound on the estimation error.
There are commonly two ways to derive a generalization upper bound:
4.1.1. Expected generalization error bound
In this approach, the goal is to find an upper bound \(\epsilon\) on the expected maximum generalization error:
\begin{equation}
\label{eq:expected_gen_error}
\E [ \sup_{f \in F} | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | ] \leq \epsilon
\end{equation}
In this approach, instead of studying the expectation of the generalization bound, we rather find a \(1 - \delta\) confidence interval:
\begin{equation}
\label{eq:high_prob_gen_error_delta}
\mathbb{P} (\sup_{f \in F} | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | > \epsilon) \leq \delta
\end{equation}
\begin{equation}
\label{eq:high_prob_gen_error_inv_delta}
\mathbb{P} (\sup_{f \in F} | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | \leq \epsilon) \geq 1 - \delta
\end{equation}
We can observe that unlike eq. \eqref{eq:expected_gen_error}, high probability bounds are stronger than expected ones. This bound indicates that it is very unlikely that the quantity of interest will exceed a certain bound, the first one on the other hand only describes the average case.
For our first attempt to derive an error bound, we will focus on simple convex surrogates such as the exponential and perceptron loss functions.
$$ \ell_{exp} (y, w) = e^{-w(2y - 1)} $$
$$ \ell_{perceptron} (y, w) = ReLU(-w(2y - 1)) $$
4.3. Generalization bound in the case of a supervised binary classification problem
Bounded differences property
We say that a function \(f : X_{1} \times X_{2} \times ... \times X_{n} \rightarrow \mathbb{R}\) satisfies the bounded differences property if \(\exists c_{1}, c_{2}, ..., c_{n}\) such that \(\forall i \leq n\) and all \(x_{1} \in X_{1}, x_{2} \in X_{2}, ..., x_{n} \in X_{n}\),
$$ \sup_{x'_{i} \in X_{i}} | f(x_{1}, ..., x_{i-1}, x_{i}, x_{i+1}, ..., x_{n}) - f(x_{1}, ..., x_{i-1}, x'_{i}, x_{i+1}, ..., x_{n})| \leq c_{i} $$
McDiarmid's Inequality
Let \(f : X_{1} \times X_{2} \times ... \times X_{n} \rightarrow \mathbb{R}\) satisfies the bounded differences property with bounds \(c_{1}, c_{2}, ..., c_{n}\)
Consider independent random variables \(\rx_{1}, \rx_{2}, ..., \rx_{n}\) where \(\rx_{i} \in X_{i}\) for all \(i\). Then, for any \(\epsilon > 0\),
$$ P(|\E[f(\rx_{1}, \rx_{2}, ..., \rx_{n})] - f(\rx_{1}, \rx_{2}, ..., \rx_{n})| \geq \epsilon) \leq 2 exp(- \frac{2 \epsilon^{2}}{\sum_{i=1}^{n} c_{i}^{2}})$$
Verifying the bounded differences property
We can re-write the previous empirical risk for a specific loss function as a function of the predictions as follows:
$$ \erisk_{\ell} (\tilde{y}) = \frac{1}{N} \sum_{i=1}^{N} \ell (y_{i}, \tilde{y}_{i}) $$
Where \(\tilde{y}\) is the vector of predictions.
We can prove that for a target function that verify \(f (x) \in \{-1, 1\}\), we have:
$$ \sup_{\tilde{y}'_{i}} | \erisk_{\ell_{perceptron}, \mathcal{S}} (\tilde{y}_{1}, ..., \tilde{y}_{i}, ...) - \erisk_{\ell_{perceptron}, \mathcal{S}} (\tilde{y}_{1}, ..., \tilde{y}'_{i}, ...) | < \frac{1}{N} \; \forall i$$
and
$$ \sup_{\tilde{y}'_{i}} | \erisk_{\ell_{exp}, \mathcal{S}} (\tilde{y}_{1}, ..., \tilde{y}_{i}, ...) - \erisk_{\ell_{exp}, \mathcal{S}} (\tilde{y}_{1}, ..., \tilde{y}'_{i}, ...) | < \frac{e - e^{-1}}{N} \; \forall i$$
By applying the McDiarmid's Inequality we find the following:
$$ P ( | \risk (f, \ell_{perceptron}) - \erisk (f, \mathcal{S}, \ell_{perceptron}) | \ge \epsilon ) \le 2 exp (- 2 N \epsilon^{2}) $$
and
$$ P ( | \risk (f, \ell_{exp}) - \erisk(f, \mathcal{S}, \ell_{exp}) | \ge \epsilon ) \le 2 exp (\frac{- 2 N \epsilon^{2}}{(e - e^{-1})^{2}}) $$
We can simplify the supremum condition as a union over the whole hypothesis space assuming for now that the hypothesis are independent:
\begin{align}
P ( \sup_{f \in \mathcal{F}} | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | > \epsilon)
& \le P ( \bigcup_{f \in \mathcal{F}} | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | > \epsilon) \nonumber \\
& \le \sum_{f \in \mathcal{F}} P ( | \risk (f, \ell) - \erisk (f, \mathcal{S}, \ell) | > \epsilon) \nonumber
\end{align}
Using now the previous inequalities:
$$ P ( \sup_{f \in \mathcal{F}} | \risk (f, \ell_{perceptron}) - \erisk (f, \mathcal{S}, \ell_{perceptron}) | > \epsilon) < 2 |\mathcal{F}| exp (- 2 N \epsilon^{2}) $$
and
$$ P ( \sup_{f \in \mathcal{F}} | \risk (f, \ell_{exp}) - \erisk (f, \mathcal{S}, \ell_{exp}) | > \epsilon) < 2 |\mathcal{F}| exp (\frac{- 2 N \epsilon^{2}}{(e - e^{-1})^{2}}) $$
Then we can derive the high-probability bounds with confidence \(1 - \delta\):
$$ \risk (f, \ell_{perceptron}) \leq \erisk (f, \mathcal{S}, \ell_{perceptron}) + \sqrt{\frac{ln(|\mathcal{F}|) + ln(\frac{2}{\delta})}{2N}} $$
and
$$ \risk (f, \ell_{exp}) \leq \erisk (f, \mathcal{S}, \ell_{exp}) + (e - e^{-1}) \sqrt{\frac{ln(|\mathcal{F}|) + ln (\frac{2}{\delta})}{2N}} $$
The bound still depend on the complexity of the hypothesis space for both loss functions, although we can see that the generalization bound of the perceptron loss is tighter than the exponential loss's. This type of bound would be tighter with function that has smaller asymptotic growing rates (\(\mathcal{O} (x)\) vs \(\mathcal{O} (e^{x})\)), plus we would have got the same result using Hoeffding's inequality.
This is due to the fact that convex surrogates are useful to solve the optimization problem and they are not a good tool to assess the error of a target function; the 0-1 loss is better in this regard.
5. Conclusion
In this work, we explored the theoretical framework behind risk minimization techniques for binary classification tasks. We studied the decomposition of the excess risk of a scoring function and how it relates to the empirical risk calculated from the data. We derived important properties of convex surrogate loss functions, as well as sufficient conditions of consistency on those convex surrogates that ensure finding the optimal scoring function. Additionally, we studied the different types of probability bound on the generalization risk and derived simple bounds for common loss functions.
6. Appendix
6.1. Proof of inconsistency
Let's define \(g_{j} = w_{j} - \min_{i} w_{i}\), such that the loss becomes \(\ell_{i} = \ell (i, g) = \left [ 1 - g_{i} + \sum_{j \neq i} g_{j} \right ]\) for a given input \(x\), and \(\forall i \; g_{i} \geq 0\).
Property 1
If there exist \(i\) such that \(\ell_{i} = 0\), then \(\forall j \neq i \; \ell_{j} > 0\).
proof
Suppose that \(\ell_{i1} = 0\) and \(\ell_{i2} = 0\) with \(i1 \neq i2\). Then
\begin{align}
& \begin{cases}
1 - g_{i1} + \sum_{j \neq i1} g_{j} \leq 0\\
1 - g_{i2} + \sum_{j \neq i2} g_{j} \leq 0
\end{cases}
\nonumber \\
\implies
& \begin{cases}
1 - g_{i1} + g_{i2} + \sum_{\substack{j \neq i1 \\ j \neq i2}} g_{j} \leq 0\\
1 - g_{i2} + g_{i1} + \sum_{\substack{j \neq i1 \\ j \neq i2}} g_{j} \leq 0
\end{cases}
\nonumber \\
\implies
& 2 + 2 \sum_{\substack{j \neq i1 \\ j \neq i2}} g_{j} \leq 0 \quad \text{summing the two inequalities}
\nonumber \\
\implies
& \sum_{\substack{j \neq i1 \\ j \neq i2}} g_{j} \leq -1 \quad \text{impossible} \nonumber
\end{align}
Which means that if \(l_{i} = 0\) then \(\forall j \neq i \; l_{j} > 0\).
6.1.1. Proof of lemma 1
proof
Let \(\ell^{*}\) be the minimizer of \(C = \E \left [ \ell_{i} | \rx = x \right ]\). From property 1, we can see that \(\ell^{*}\) verify one of the following cases: (1) \(\forall i \; l^{*}_{i} > 0\) or (2) \(\exists m \; l^{*}_{m} = 0\) and \(\forall i \neq m \; l^{*}_{i} > 0\).
case (1)
\(\forall i \; l^{*}_{i} > 0\), the problem becomes
\begin{align}
\min_{g} & \sum_{i} P_{i} (1 - g_{i} + \sum_{j \neq i} g_{j}) \nonumber \\
\text{s.t.\ } & \forall i \; g_{i} \geq 0 \nonumber
\end{align}
The solution of this problem satisfies the following K.K.T. conditions:
\begin{align}
\forall i : \quad & \frac{\partial}{\partial g_{i}} \left ( \sum_{i} P_{i} (1 - g_{i} + \sum_{j \neq i} g_{j}) - \sum_{i} \lambda_{i} g_{i} \right ) = 0 \nonumber \\
\forall i : \quad & g_{i} \geq 0 \nonumber \\
\forall i : \quad & \lambda_{i} \geq 0 \nonumber \\
& \sum_{i} \lambda_{i} g_{i} = 0 \nonumber
\end{align}
From the stationarity condition we have
$$ \forall i : \quad - P_{i} + \sum_{j \neq i} P_{j} - \lambda_{i} = 0 $$
thus
$$ \forall i : \quad - P_{i} + \sum_{j \neq i} P_{j} = \lambda_{i} $$
with \(\lambda_{i} \geq 0\).
If \(\max_{i} P_{i} > 1/2\) then the previous inequality is impossible.
Otherwise, \(\forall i \quad P_{i} < 1/2\) then \(\lambda_{i} > 0\) which means that \(g_{i} = 0\) (from complementary slackness condition).
From this follows, \(\forall i \quad \ell^{*}_{i} = 1\) and \(\min_{g} C = 1\).
case (2)
\(\exists m \; l^{*}_{m} = 0\) and \(\forall i \neq m \; l^{*}_{i} > 0\), and now the problem becomes
\begin{align}
\min_{g} & \sum_{i} P_{i} (1 - g_{i} + \sum_{j \neq i} g_{j}) \nonumber \\
\text{s.t.\ } & \forall i \; g_{i} \geq 0 \nonumber \\
& 1 - g_{m} + \sum_{j \neq m} g_{j} \leq 0 \nonumber
\end{align}
From the stationarity condition we have
\begin{align}
\forall i \neq m \quad & - P_{i} + \sum_{\substack{j \neq i \\ j \neq m}} P_{j} - \lambda_{i} + \mu = 0 \nonumber \\
i = m \quad & \sum_{j \neq m} P_{j} - \lambda_{m} - \mu = 0 \nonumber
\end{align}
We know that \(g_{m} > 0\) then from complementary slackness condition \(\lambda_{m} = 0\). Thus
\begin{align}
\mu = \sum_{j \neq m} P_{j} \nonumber
\end{align}
If \(\sum_{j \neq m} P_{j} = 0\) then \(C = \min C = 0\) for \(\ell_{m} = 0\).
Otherwise \(\mu \neq 0\) which means from complementary slackness condition
\begin{equation}
\label{eq:star}
1 + \sum_{j \neq m} g_{j} = g_{m}
\end{equation}
We have also
\begin{align}
\forall i \neq m \quad & - P_{i} + \sum_{\substack{j \neq i \\ j \neq m}} P_{j} - \lambda_{i} + \sum_{j \neq m} g_{j} = 0 \nonumber \\
\implies \forall i \neq m \quad & - P_{i} + \sum_{\substack{j \neq i \\ j \neq m}} P_{j} - \lambda_{i} + \sum_{\substack{j \neq m \\ j \neq i}} P_{j} + P_{i} = 0 \nonumber \\
\implies \forall i \neq m \quad & \lambda_{i} = 2 \sum_{\substack{j \neq i \\ j \neq m}} P_{j} \nonumber
\end{align}
Again, here we have 2 cases:
\(\forall i \neq m \quad \sum_{\substack{j \neq m \\ j \neq i}} P_{j} > 0\) then from complementary slackness \(\forall i \neq m \quad g_{i} = 0\) and from eq. \eqref{eq:star}, \(g_{m} = 1\).
It follows that \(\min C = 2 \sum_{i \neq m} P_{i}\).
There exist a unique \(i\) such that \(\sum_{\substack{j \neq m \\ j \neq i}} P_{j} = 0\) which means that \(P_{i} + P_{m} = 1\). Consequently, \(\min C = P_{i} (1 - g_{i} + g_{m})\) and using eq. \eqref{eq:star}, we find that \(\min C = 2 P_{i} = 2 \sum_{j \neq m} P_{j}\). The same as the first case.
Finally, we can see that \(2 \sum_{i \neq m} P_{i} < 1\) only if \(P_{m} > 1/2\).
Meaning that if \(\max_{i} P_{i} > 1/2\) then there exists \(m\) such that \(\ell^{*}_{m} = 0\).
Otherwise, \(\forall i \quad \ell^{*}_{i} = 1\) and \(g^{*}_{i} = 0\).
References
[1] Ali Akbari, Muhammad Awais, Manijeh Bashar, and Josef Kittler. How does loss function affect generalization performance of deep learning? application to human age estimation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 141–151. PMLR, 18–24 Jul 2021.
[2] Francis Bach. Learning theory from first principles. lecture 1-3. Class notes for Mathematics, Machine Learning, Sciences, and Humanities: Master’s Year 2, October 2020.
[3] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
[4] Yi Lin. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
[5] Michael M. Wolf. Mathematical foundations of supervised learning. Class notes, July 2020.
1 Classification-calibration is also referred to in the literature as Bayes consistency or Fisher consistency.
2 A margin-based loss function is a function that maps the product \(w (2y - 1)\), called the margin, to \(\mathbb{R}\).