Concentration inequalities – Part I
Introduction
Concentration inequalities, or probability bounds, are very important tools for the analysis of Machine Learning algorithms or randomized algorithms. In statistical learning theory, we often want to show that random variables, given some assumptions, are close to its expectation with high probability. This article provides an overview of the most basic inequalities in the analysis of these concentration measures.
Markov’s Inequality
The Markov’s inequality is one of the most basic bounds and it assumes almost nothing about the random variable. The assumptions that Markov’s inequality makes is that the random variable \(X\) is non-negative \(X > 0\) and has a finite expectation \(\mathbb{E}\left[X\right] < \infty\). The Markov’s inequality is given by:
$$\underbrace{P(X \geq \alpha)}_{\text{Probability of being greater than constant } \alpha} \leq \underbrace{\frac{\mathbb{E}\left[X\right]}{\alpha}}_{\text{Bounded above by expectation over constant } \alpha}$$
What this means is that the probability that the random variable \(X\) will be bounded by the expectation of \(X\) divided by the constant \(\alpha\). What is remarkable about this bound, is that it holds for any distribution with positive values and it doesn’t depend on any feature of the probability distribution, it only requires some weak assumptions and its first moment, the expectation.
Example: A grocery store sells an average of 40 beers per day (it’s summer !). What is the probability that it will sell 80 or more beers tomorrow ?
$$
\begin{align}
P(X \geq \alpha) & \leq\frac{\mathbb{E}\left[X\right]}{\alpha} \\\\
P(X \geq 80) & \leq\frac{40}{80} = 0.5 = 50\%
\end{align}
$$
The Markov’s inequality doesn’t depend on any property of the random variable probability distribution, so it’s obvious that there are better bounds to use if information about the probability distribution is available.
Chebyshev’s Inequality
When we have information about the underlying distribution of a random variable, we can take advantage of properties of this distribution to know more about the concentration of this variable. Let’s take for example a normal distribution with mean \(\mu = 0\) and unit standard deviation \(\sigma = 1\) given by the probability density function (PDF) below:
$$ f(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} $$
Integrating from -1 to 1: \(\int_{-1}^{1} \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\), we know that 68% of the data is within \(1\sigma\) (one standard deviation) from the mean \(\mu\) and 95% is within \(2\sigma\) from the mean. However, when it’s not possible to assume normality, any other amount of data can be concentrated within \(1\sigma\) or \(2\sigma\).
Chebyshev’s inequality provides a way to get a bound on the concentration for any distribution, without assuming any underlying property except a finite mean and variance. Chebyshev’s also holds for any random variable, not only for non-negative variables as in Markov’s inequality.
The Chebyshev’s inequality is given by the following relation:
$$
P( \mid X – \mu \mid \geq k\sigma) \leq \frac{1}{k^2}
$$
that can also be rewritten as:
$$
P(\mid X – \mu \mid < k\sigma) \geq 1 – \frac{1}{k^2}
$$
For the concrete case of \(k = 2\), the Chebyshev’s tells us that at least 75% of the data is concentrated within 2 standard deviations of the mean. And this holds for any distribution.
Now, when we compare this result for \( k = 2 \) with the 95% concentration of the normal distribution for \(2\sigma\), we can see how conservative is the Chebyshev’s bound. However, one must not forget that this holds for any distribution and not only for a normally distributed random variable, and all that Chebyshev’s needs, is the first and second moments of the data. Something important to note is that in absence of more information about the random variable, this cannot be improved.
Chebyshev’s Inequality and the Weak Law of Large Numbers
Chebyshev’s inequality can also be used to prove the weak law of large numbers, which says that the sample mean converges in probability towards the true mean.
That can be done as follows:
- Consider a sequence of i.i.d. (independent and identically distributed) random variables \(X_1, X_2, X_3, \ldots\) with mean \(\mu\) and variance \(\sigma^2\);
- The sample mean is \(M_n = \frac{X_1 + \ldots + X_n}{n}\) and the true mean is \(\mu\);
- For the expectation of the sample mean we have: $$\mathbb{E}\left[M_n\right] = \frac{\mathbb{E}\left[X_1\right] + \ldots +\mathbb{E}\left[X_n\right]}{n} = \frac{n\mu}{n} = \mu$$
- For the variance of the sample we have: $$Var\left[M_n\right] = \frac{Var\left[X_1\right] + \ldots +Var\left[X_n\right]}{n^2} = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$
- By the application of the Chebyshev’s inequality we have: $$ P(\mid M_n – \mu \mid \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2}$$ for any (fixed) \(\epsilon > 0\), as \(n\) increases, the right side of the inequality goes to zero. Intuitively, this means that for a large \(n\) the concentration of the distribution of \(M_n\) will be around \(\mu\).
Improving on Markov’s and Chebyshev’s with Chernoff Bounds
Before getting into the Chernoff bound, let’s understand the motivation behind it and how one can improve on Chebyshev’s bound. To understand it, we first need to understand the difference between a pairwise independence and mutual independence. For the pairwise independence, we have the following for A, B, and C:
$$
P(A \cap B) = P(A)P(B) \\
P(A \cap C) = P(A)P(C) \\
P(B \cap C) = P(B)P(C)
$$
Which means that any pair (any two events) are independent, but not necessarily that:
$$
P(A \cap B\cap C) = P(A)P(B)P(C)
$$
which is called “mutual independence” and it is a stronger independence. By definition, the mutual independence assumes the pairwise independence but the opposite isn’t always true. And this is the case where we can improve on Chebyshev’s bound, as it is not possible without doing these further assumptions (stronger assumptions leads to stronger bounds).
We’ll talk about the Chernoff bounds in the second part of this tutorial !
Is there is a typo in equation 9? Should be $\frac{\sigma^2}{n}$ right?
You’re right, thanks for seeing this, will fix it now.
Hi, Thanks for the nice post. Is there a simple illustration of the Concentration Inequalities may be used to analyze a commonly employed/popular ML algorithm (not related the PAC Theory, etc.)? I shall be grateful, if you could point to one. Thanks.