and, since is the space of possible permutations of size , (of which there are of) then and then must satisfy
In other words, the number of queries (or comparisons, or whatever) must be approximately at least as large as , asymptotically (up to constant multiples). In fact, this proves a slightly stronger statement that no probabilistic algorithm can succeed with nonvanishing probability if the number of queries is not on the order of , which our original proof above does not cover!
Now, the last missing piece is showing that the probability of error is bounded from below.
This is a slightly simplified version of the proof presented in Duchi's notes (see section 2.3.2) for the specific case we care about, which requires less additional definitions and supporting statements. Let be the 'error' random variable that is if and 0 otherwise, then let's look at the quantity :
by our definition of conditional entropy, and
again by the same definition. Since whenever there is no error, , then since is known, so, we really have
Since can only take on two values, we have that and we also have that , which gives
Now, we have that
The first inequality follows from the fact that we're removing a variable and the second follows from statement 3 in the previous section (as ). Using the definition of , then we have
The first inequality here follows since we're (again!) removing a variable and the equality follows from the fact that is uniformly randomly drawn from and the last inequality follows from the fact that the entropy of is always smaller than that of the uniform distribution on . Finally, note that, if we have queries, then (this is the number of possible values a sequence of binary queries can take on). So, (in other words, the maximum amount of information we can get with binary queries is bits) so we find
or, after some slight rewriting:
as required!
I think overall that Fano's inequality is a relatively straightforward way of justifying a ton of the statements one can make about information theory without needing to invoke a large number of more complicated notions. Additionally, the proof is relatively straightforward (in the sense that it only requires very few definitions and properties of the entropy) while also matching our intuition about these problems pretty much exactly.
In particular, we see that sorting is hard not because somehow ordering elements is difficult, but because we have to decide between a bunch of different items (in fact, of them) while only receiving a few bits of information at any point in time! In fact, this bound applies to any yes/no query that one can make of the sorted data, not just comparisons, which is interesting.
There are some even more powerful generalizations of Fano's inequality which can also be extended to machine learning applications: you can use them to show that, given only some small amount of data, you cannot decide between a parameter that correctly describe the data and one that does not.
This is all to say that, even though entropy is a magical quantity, that doesn't mean we can't say very rigorous things about it (and make our intuitions about lower bounds even more rigorous, to boot!).
[1] | In fact, the entropy is really a measure of how close a variable is to the uniform distribution, in the case of compact domains —the higher the entropy, the closer it is. |