2019-06-01
Ok so this is a bit embarrassing. A friend asked me this question when explaining some concepts in this Big Data class I was taking at the time. Although I have learned entropy in my physics, computer engineering, and probability classes, I was unable to explain what it was exactly. Honestly, the thought of entropy, cross entropy, and KL divergence used to scare me...it just sounds so technical. However, engineers just love to make easy things seem difficult!
Ok so we are going to talk about this "entropy" term in respect to bits. Say there are 3 people (X, Y, and Z) who share a phone messaging plan. Say person X is from generation X and doesn't really use messaging, so he uses around 20% of the plan. Say person Y is from generation Y (millennials), so he will use around 30% of the plan. Finally, person Z is from generation Z, and of course he uses 50% of the plan. Let's call this distribution y:
distribution y: X = 0.2
Y = 0.3
Z = 0.5
Now, every time a person in this plan sends a message, the sender is identified through an unique bit sequence. The longer the bit sequence, the more costly it is per message. Let's randomly assign a bit encoding described below:
encoding #1: X --> (0) 1 bit
Y --> (10) 2 bits
Z --> (11) 2 bits
We can calculate the average number of bits used to send messages in this phone plan based on the encoding #1:
(0.2 1 bit) + (0.3 2 bits ) + (0.5 2 bits) = 1.8 bits
But can we do better than 1.8 bits? What if we encode the sender knowing the underlying distribution, for example, Z sends a lot of messages, so let's swap his encoding with X. Let's look at another encoding.
encoding #2: X --> (11) 2 bits
Y --> (10) 2 bits
Z --> (0) 1 bit
We can calculate the average number of bits used to send messages in this phone plan based on the encoding #2:
(0.2 2 bits) + (0.3 2 bits ) + (0.5 1 bit) = 1.5 bits
We can see that encoding #2 results in a smaller average number of bits. Now the question is, how do we know if our encoding is the best one that achieves the smallest average encoding size? Now the answer to this question is exactly "entropy."
We can always find entropy given a distribution. In the equation above, represents the number of bits to assign to the symbol, in our case, person X, Y, and Z. This makes sense intuitively because if is large, then would be small. This means that we want to use less bits to encode a more frequent symbol, in our case, we want to use the lest amount of bits for person Z.
Let's calculate the entropy for the above distribution y:
For entropy, we effectively used our knowledge of distribution y to assign the bits for each person so that the total cost would be the minimum. But what if we didn't know the correct distribution? What if we thought that it was actually distribution x? Then cross entropy calculates the average number of bits used using the wrong distribution.
You see that the only difference between the entropy equation and the cross entropy equation is . This makes sense since now we are encoding the symbol using bits instead of bits.
distribution x: X = 0.2
Y = 0.5
Z = 0.3
Stands for Kullback-Leibler divergence, which is literally just the difference between cross entropy and entropy. Remember, cross entropy is always greater or equal to entropy and it's shown in the example above. Basically, KL divergence means how much extra bits is used if distribution x is used, rather than distribution y. In a more general sense, it is a measure of how a probability distribution differs from another probability distribution.
I was asked by a kind viewer: why should we understand entropy/cross entropy? And what are some practical applications? Well, below is my response:
The "entropy" described in the post is really entropy relating to information theory, thanks to Claude Shannon. A practical situation would be the original problem Claude Shannon was trying to solve, which is: what is the most efficient way to send messages without losing information? The efficiency described here would be the average message length. So given a distribution with random variables (X, Y, Z, ...), the least amount of bits to encode symbol X (a type of message) would be log(1/probability(X)). Then a good feature to summarize/quantify this information would be "entropy," which is the smallest possible average size of lossless encoding of the messages. But really, I think about "entropy" as a characteristic of a distribution. It tells me something about the distribution, for example, if the entropy of a message distribution is high (the average number of bits is large), then this entropy tells me that there is uncertainty or unpredictability of the next message. Inversely, low entropy tells me that the next message is predicable, so there is less disorder.
But more importantly, I think understanding entropy this way helps me understand cross entropy, which is very important in creating and training models for machine learning. For example, if a machine learning model outputs a distribution x, but the true distribution is distribution y, then cross entropy tells me how wrong the model is. Therefore, cross entropy is often used as a loss model, and minimizing the cross entropy will help the model "learn."