summaryrefslogtreecommitdiff
path: root/05-18.md
blob: d18363113469677e9342db6b76383ba45a39069f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
[\<- 05/15](05-15.md)

---

# An Application of Binary Trees: Huffman Coding

## Encoding

- Examples
	- Text
	- Video
	- Image
	- Audio

- Computer only understands low layer bits (0s and 1s)

- Encoding is turning those examples into the low layer bits

### An Application - Encoding

- An example - using only 1 and 0s to represent following cases
	- true or false - only need 1 bit
		- 1 is true
		- 0 is false
	- 4 directions: south, north, east, and west - need 2 bits
		- 00 is south
		- 01 is north
		- 10 is east
		- 11 is west

- ASCII is a fixed width encoding. Every character requires 8 bits to encode

- Motiviation: Can we save some bits in encoding?
- Thoughts:
	- some characters (like e and n) occur frequently. whereas characters (like q and z) occur very infrequently
	- Do you still remember probability search (move to front heuristic)?

### One Example

- Ex. "the fat cat sat on the mat"
	- If we use fixed-length binary code to represent the above sentence, what will that be? 26\*8
	- Can we save some bits by using fewer bits for more frequent letters?
		- Frequency of letters:
			- a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6
		- Bits of letters
			- t & " ", frequently occur - 1 bit
			- a, h, e, less frequently - 2bits
			- etc.

## Huffman Coding

- Huffman Coding: **variable width encoding**
	- We are likely to **save on the total number of bits** required to encode a file/document if we represented frequently occurring character with fewer bits and infrequently occurring characters with more than 8 bits

- Question: How do we know, given a variable length encoding, where characters stop and start?
	- ex: if e=11, z=1111... so what's 111111?

- Answer: We use **unique prefixes**. Each character will have its own unique prefix -> how to implement?
	- We build a binary tree. By having only leaf nodes to represent characters, we are able to guarantee unique prefixes

## Huffman Tree

- Use binary trees (i.e. Huffman tree). e.g. 'ABCA'

![example tree](05-18_img1.png)

- B: 00
- C: 01
- A: 1

- Question: How do we determine the optimal encoding given the frequencies of the characters?
- Answer: Use the leaf nodes that are farther away from the root to represent characters appear less frequently

- so now you have the "Huffman tree"
- label the left and right sub-branches of every node as 0 and 1 respectively

## Huffman Coding

- Ex. "the fat cat sat on the mat"
	- a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6
	- create a node for each letter and sort according to their values
	- remove two minimum nodes, combine with a new node, and insert this node with a new weight - the combined value of its two child nodes
	- Is it better than the fixed length encoding?

![example tree](05-18_img2.png)

- This is our Huffman Tree
- Our letters translate to:
	- a   -> 011
	- c   -> 00000
	- e   -> 0100
	- f   -> 00001
	- h   -> 0101
	- m   -> 00010
	- n   -> 00011
	- o   -> 0010
	- s   -> 0011
	- t   -> 10
	- " " -> 11

- This encoding only requires 80 bits (which significantly saves the number of bits needed for encoding)

---

### Code Structure

- A struct node includes:
	- `struct node *parent`
	- `int count`

1. Open file
	- count frequency of each character
	- need a while loop, need to use `getc()`
	- put characters in an array, assigning them an index reflecting their ascii code
	- the last slot on the array will be the EOF
	- the length of the array will be 257 (2^8 + 1)

2. `NODE *mknode(int data, NODE *left, NODE *right);`
	- We can use this function to generate the leaf nodes as well
	- also need node comparison

```
NODE *mknode(int data, NODE *left, NODE *right){
	NODE *np = malloc(...);
	assert(...);
	np->parent = NULL;
	np->count = data;
	//connect
}
```

3. `pq = createqueue()`
	- call addEntry: `addEntry(pq, nodes[i]);`
	- `l = removeEntry(pq);`
	- `r = removeEntry(pq);`
	- `new = mknode(l->count + r->conut, l, r)`
	- `addEntry(pq, np);`

4. print