Next, we'll look at Huffman encoding. this is a classic algorithm that is still

widely-used and effectively used today. it's definitely on the list of algorithms

that everyone should know cuz it's quite ingenious. so we start out with the idea

of variable length codes. so far, we've talked about codes where every character

is represented with the same number of bits. But there's no reason to do that.

For example, look at Morse code. the idea in a way that we assign dots and dashes to

letters is that frequently, you'll use letters should have smaller number of

characters. so for example, an E is only one dot and so forth. And letters that are

used less frequently, like Q are going to be longer. More dashes and more

characters. So with this we can, it's an idea, a compression idea, of let's use

fewer characters for frequent fewer bits or fewer code characters for frequently

used characters. now, there's an issue with this and that is that it can be

ambiguous. So, this message everybody knows, looks like SOS, three dots followed

by three dashes, followed by three dots. but actually it could be it's any one of

these others. Like V is dot, dot, dot, dash. Thats good. and then, seven is dash,

dash, dot, dot, dot, so it could be V7. Or it could be either of these two. There's

lots of different things that this thing could be. So, Morse code is actually, it's

not just dots and dashes it actually has they have a little space between the code

words to avoid this problem that there's an ambiguity caused by the fact that one

code word can be prefix of another and there's no way to tell that it's not

without separating the code words. so now that's the way it's solved in Morse code

and that's not necessarily the most efficient way to solve it, and that's what

Huffman codes addresses. Also, a problem with Morse code is don't be trying to

transmit a lot of numbers with Morse code, because the codes for those are pretty

long and, and you'll have much longer representations for codes that involve a

lot of messages and involve a lot of numbers. But there's a really elegant way

to avoid ambiguity. And that is to just adopt a rule that no code word is going to

be a prefix of another one. so fixed length codes do that. If every code is the

same number of bits and every character encode different bits, then no code word's

a prefix of another. They're just all different. another thing you can do is

have a, a special stop character like, like the space in Morse Code to end to

each code word. That's really what it's doing. so the code words are all

different. And they end with a character that's special. And so, none can be a

prefix of another one. but there's also just an easy way to just make sure that

you use a code that has this prefix free property. so, for example, this code here

no code word is a prefix of another. And it means when you have bits there's a

unique way to decode. So here, the compressed bitstream starts with zero, the

only way that can happen is if the first letter is A. Then, when you're done with

the A you've got four ones and that means it's got to be a B. and then, you get rid

of the B, and so forth. And it's three ones and a zero, it's got to be an R, and

so forth. Since no code word is a prefix of another you can just read off the code

from the bit string without having any special stop character or any efficiency,

inefficiency caused by that. all we have is the bits and we are able to uniquely

decode it. And there is numerous prefix-free codes. For example here's

another prefix-free code that for the same five characters and it actually uses fewer

bits. So, the idea of Huffman encoding within these rules where we must have a

prefix-free, free code. And we've got a particular message. what's amazing about

Huffman encoding is, that it finds the code that uses the fewest bits for that

message. and that's why, you know, it's so effective. So the interesting thing one

interesting thing about prefix-free code is that they're really easy to represent

with trie. and, and the idea is very simple. We put characters in the leaves

and we imagine that every, this is binary trie, so we imagine that there's only two

links per node. And we imagine that every left link is labeled zero and every right

link is labeled one. and then, just following the path from the root to a

leaf, so say we go right, so it's one, zero, zero, that's D. Those bits are not

stored in the trie, we just keep, we just implicitly keep track of them. if we go

left we keep zero, if we go right we keep one. And so a try for prefix-free code

uniquely determines the code. and that means that just working with the trie we

can decompress a bit string just by starting at the root, reading the bits

take them where they take us, and when we get to a leaf, we've got a character. so

that's the trie corresponding to that code, that's the trie corresponding to

that code. this one starts out with one, one, so starting at the root, we go one,

one, and the first letter is A. And then, the next letters are zero, zero, and so we

go zero, zero, and we come to B, and so forth. a simple representation of a

prefix-free code that is that makes for easy decoding, given the bit, bit string

in the trie, it's pretty easy to decode. you could also keep a symbol table of

pairs but sorry, for compression, how do we do compression that was in expansion?

so, for compression, we can just keep the table and use it. you could also use the

trie to figure out the code by following the path h, up to the root. and for

expansion, as I just said, you start out at the root I go left at zero, right of

one and it's very easy. so for among comp, expansion, so we will look at the code for

both of these. The question is how to construct, the first question is how to

construct the trie. so let's start with the data type. so this is similar to other

tree structures that we've built before where we have a character that we don't

use for internal nodes, just for leaves. we've got the frequency that we use while

we're building the tribe, not when we're expanding and then every node has a left

and a right. and th is, is the constructor, you just fill in all the

fields. we need a method to tells us if the current node is a leaf. And that's

when so that's if both its links are null. and we need to compare it to, to compare

nodes by frequency. And we'll see why that's needed later on. so that's the data

type for the nodes that we're going to use to build the tries. and so now, let's look

at expansion. So, this is in code what I just talked about. so, we're going to have

a method that reads in some of the bit string, and converts it to a trie. Now,

that's one clever part of the algorithm. and then the first thing is the number of

characters in the code in the message, you know, we also get that. So, we get the

trie, and then, we get the number of characters. and then we simply for do a

for loop for the number of characters and start at the root. if we're not at a leaf

we read a bit. And if it's zero, we go left, and if it's one, we go right. and

that's it. if as soon as we get to a leaf, we just write out the character that we

get and then, go back to the root and keep going. We're not in a leaf if we read a

bit, we have zero, go to the left, one, go to the right. And when we get to a leaf,

we write out the character extremely compact code for expanding. given a bit

string, the first thing is to trie. We have to talk about how to get the trie in

from a bit string number of characters. and then, the simple loop that expands

according to the trie representation. so that's networks for any prefix-free code,

not just the Cuffman, Huffman code. in the running your time, it's obviously linear

in the number of bits cuz for it's just a loop that chews up a bit every time in the

loop. Okay. So, so again, for any prefix-free, free code, how are we

actually going to write the trie? well it's a little, little clever but by this

time, you should be ready for this kind of algorithm. what you can do is traverse the

trie in pre-order and when you and then come to an internal node you write a zero.

when you come to a leaf, you write a one f ollowed by the 8-bit character at that

leaf. so it's a simple pre-order traversal. if this is a recursive program

to write out in trie if you're at a leaf, you write a one followed by the 8-bit

characters at the leaf and you're done. and if you're not at the leaf, you simply

write a zero. and then recursively, do the two sub-tries. and that gives a unique

representation of the trie that can be used to construct the trie from that bit

string. so and the idea is that typically, we're talking about a, a very long message

the trie itself is just basically encodings of all the possible characters

in a message. and that's going to tend to be small compared to the length of the

message. so for say, a genomic sequence there would be only four characters and

the ones that used most frequently, hopefully would be encoded with not that

many bits. of course, but anyways, the size of the trie itself is going to be

much smaller than the length of the message. so reading in the trie and this

requires a little bit of thought, but and we see the code, the code is extremely

simple. We justreconstructed from the pre-order traversal to trie. So, this is

the readTrie method that we needed, that reads the bits from binary standard in and

produces the trie, and it's also recursive. and if the bit that you read is

zero that sorry, if the bit that you read is one that means you have to create a

leaf otherwise, you recursively read the left and read the right. and make a new

node that has a subtree that, that denotes to the left and the right. And it doesn't

it doesn't use the character in the second one, the frequency we initialize to zero.

so, in internal nodes, we don't use the character. So if you look at what would

this code do for this bit string so the first bit is zero. so it's going to