And let's suppose we have good statistics in our application domain about
exactly how frequent each of these symbols are.
So, in particular, let's assume we know that A is by far the most likely symbol.
Let's say 60% of the symbols are going to be As,
whereas 25% are Bs, 10% are Cs, and 5% are Ds.
So why would you know the statistics?
Well, in some domains you're just going to have a lot of expertise.
In genomics you're going to know the usual frequencies of As, Cs, Gs and Ts.
For something like an mp3 file, well, you can literally just take an intermediate
version of the file after you've done the analog to digital transformation, and
just count the number of occurrences of each of the symbols.
And then you know exact frequencies, and then you're good to go.
So let's compare the performance of the obvious fixed length code,
where we used 2 bits for each of the 4 characters, with that of the variable
length code that's also prefix-free that we mentioned on the previous slide.
And we're going to measure the performance of these codes by looking, on average,
how many bits do you need to encode a character.
Where the average is over the frequencies of the four different symbols.