[\<- 05/15](05-15.md)

---

# An Application of Binary Trees: Huffman Coding

## Encoding

- Examples
	- Text
	- Video
	- Image
	- Audio

- Computer only understands low layer bits (0s and 1s)

- Encoding is turning those examples into the low layer bits

### An Application - Encoding

- An example - using only 1 and 0s to represent following cases
	- true or false - only need 1 bit
		- 1 is true
		- 0 is false
	- 4 directions: south, north, east, and west - need 2 bits
		- 00 is south
		- 01 is north
		- 10 is east
		- 11 is west

- ASCII is a fixed width encoding. Every character requires 8 bits to encode

- Motiviation: Can we save some bits in encoding?
- Thoughts:
	- some characters (like e and n) occur frequently. whereas characters (like q and z) occur very infrequently
	- Do you still remember probability search (move to front heuristic)?

### One Example

- Ex. "the fat cat sat on the mat"
	- If we use fixed-length binary code to represent the above sentence, what will that be? 26\*8
	- Can we save some bits by using fewer bits for more frequent letters?
		- Frequency of letters:
			- a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6
		- Bits of letters
			- t & " ", frequently occur - 1 bit
			- a, h, e, less frequently - 2bits
			- etc.

## Huffman Coding

- Huffman Coding: **variable width encoding**
	- We are likely to **save on the total number of bits** required to encode a file/document if we represented frequently occurring character with fewer bits and infrequently occurring characters with more than 8 bits

- Question: How do we know, given a variable length encoding, where characters stop and start?
	- ex: if e=11, z=1111... so what's 111111?

- Answer: We use **unique prefixes**. Each character will have its own unique prefix -> how to implement?
	- We build a binary tree. By having only leaf nodes to represent characters, we are able to guarantee unique prefixes

## Huffman Tree

- Use binary trees (i.e. Huffman tree). e.g. 'ABCA'

![example tree](05-18_img1.png)

- B: 00
- C: 01
- A: 1

- Question: How do we determine the optimal encoding given the frequencies of the characters?
- Answer: Use the leaf nodes that are farther away from the root to represent characters appear less frequently

- so now you have the "Huffman tree"
- label the left and right sub-branches of every node as 0 and 1 respectively

## Huffman Coding

- Ex. "the fat cat sat on the mat"
	- a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6
	- create a node for each letter and sort according to their values
	- remove two minimum nodes, combine with a new node, and insert this node with a new weight - the combined value of its two child nodes
	- Is it better than the fixed length encoding?

![example tree](05-18_img2.png)

- This is our Huffman Tree
- Our letters translate to:
	- a   -> 011
	- c   -> 00000
	- e   -> 0100
	- f   -> 00001
	- h   -> 0101
	- m   -> 00010
	- n   -> 00011
	- o   -> 0010
	- s   -> 0011
	- t   -> 10
	- " " -> 11

- This encoding only requires 80 bits (which significantly saves the number of bits needed for encoding)

---

### Code Structure

- A struct node includes:
	- `struct node *parent`
	- `int count`

1. Open file
	- count frequency of each character
	- need a while loop, need to use `getc()`
	- put characters in an array, assigning them an index reflecting their ascii code
	- the last slot on the array will be the EOF
	- the length of the array will be 257 (2^8 + 1)

2. `NODE *mknode(int data, NODE *left, NODE *right);`
	- We can use this function to generate the leaf nodes as well
	- also need node comparison

```
NODE *mknode(int data, NODE *left, NODE *right){
	NODE *np = malloc(...);
	assert(...);
	np->parent = NULL;
	np->count = data;
	//connect
}
```

3. `pq = createqueue()`
	- call addEntry: `addEntry(pq, nodes[i]);`
	- `l = removeEntry(pq);`
	- `r = removeEntry(pq);`
	- `new = mknode(l->count + r->conut, l, r)`
	- `addEntry(pq, np);`

4. print 

---

[05/22 ->](05-22.md)