From cf705f02b242ae198dcb6a76c404990937feb52e Mon Sep 17 00:00:00 2001 From: lshprung Date: Mon, 18 May 2020 11:28:37 -0700 Subject: Post-class 05/18 --- 05-15.md | 4 ++ 05-18.md | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 05-18_img1.png | Bin 0 -> 27224 bytes 05-18_img2.png | Bin 0 -> 139860 bytes 4 files changed, 143 insertions(+) create mode 100644 05-18.md create mode 100644 05-18_img1.png create mode 100644 05-18_img2.png diff --git a/05-15.md b/05-15.md index 9a34c34..61dace5 100644 --- a/05-15.md +++ b/05-15.md @@ -97,3 +97,7 @@ binary heaps work because of **no holes**, and are a great example of an implici - Generally speaking, a heap can be used whenever you need **quick access** to the largest (or smallest) item, because that item will always be the first element in the array or at the root of the tree - It's not good for searching, since the remainder of the array is kept partially unsorted - Some examples of heap: priority queue, Huffman coding, heap sort, etc. + +--- + +[05/18 ->](05-18.md) diff --git a/05-18.md b/05-18.md new file mode 100644 index 0000000..d183631 --- /dev/null +++ b/05-18.md @@ -0,0 +1,139 @@ +[\<- 05/15](05-15.md) + +--- + +# An Application of Binary Trees: Huffman Coding + +## Encoding + +- Examples + - Text + - Video + - Image + - Audio + +- Computer only understands low layer bits (0s and 1s) + +- Encoding is turning those examples into the low layer bits + +### An Application - Encoding + +- An example - using only 1 and 0s to represent following cases + - true or false - only need 1 bit + - 1 is true + - 0 is false + - 4 directions: south, north, east, and west - need 2 bits + - 00 is south + - 01 is north + - 10 is east + - 11 is west + +- ASCII is a fixed width encoding. Every character requires 8 bits to encode + +- Motiviation: Can we save some bits in encoding? +- Thoughts: + - some characters (like e and n) occur frequently. whereas characters (like q and z) occur very infrequently + - Do you still remember probability search (move to front heuristic)? + +### One Example + +- Ex. "the fat cat sat on the mat" + - If we use fixed-length binary code to represent the above sentence, what will that be? 26\*8 + - Can we save some bits by using fewer bits for more frequent letters? + - Frequency of letters: + - a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6 + - Bits of letters + - t & " ", frequently occur - 1 bit + - a, h, e, less frequently - 2bits + - etc. + +## Huffman Coding + +- Huffman Coding: **variable width encoding** + - We are likely to **save on the total number of bits** required to encode a file/document if we represented frequently occurring character with fewer bits and infrequently occurring characters with more than 8 bits + +- Question: How do we know, given a variable length encoding, where characters stop and start? + - ex: if e=11, z=1111... so what's 111111? + +- Answer: We use **unique prefixes**. Each character will have its own unique prefix -> how to implement? + - We build a binary tree. By having only leaf nodes to represent characters, we are able to guarantee unique prefixes + +## Huffman Tree + +- Use binary trees (i.e. Huffman tree). e.g. 'ABCA' + +![example tree](05-18_img1.png) + +- B: 00 +- C: 01 +- A: 1 + +- Question: How do we determine the optimal encoding given the frequencies of the characters? +- Answer: Use the leaf nodes that are farther away from the root to represent characters appear less frequently + +- so now you have the "Huffman tree" +- label the left and right sub-branches of every node as 0 and 1 respectively + +## Huffman Coding + +- Ex. "the fat cat sat on the mat" + - a:4, c:1, e:2, f:1, h:2, m:1, n:1, o:1, s:1, t:6, " ":6 + - create a node for each letter and sort according to their values + - remove two minimum nodes, combine with a new node, and insert this node with a new weight - the combined value of its two child nodes + - Is it better than the fixed length encoding? + +![example tree](05-18_img2.png) + +- This is our Huffman Tree +- Our letters translate to: + - a -> 011 + - c -> 00000 + - e -> 0100 + - f -> 00001 + - h -> 0101 + - m -> 00010 + - n -> 00011 + - o -> 0010 + - s -> 0011 + - t -> 10 + - " " -> 11 + +- This encoding only requires 80 bits (which significantly saves the number of bits needed for encoding) + +--- + +### Code Structure + +- A struct node includes: + - `struct node *parent` + - `int count` + +1. Open file + - count frequency of each character + - need a while loop, need to use `getc()` + - put characters in an array, assigning them an index reflecting their ascii code + - the last slot on the array will be the EOF + - the length of the array will be 257 (2^8 + 1) + +2. `NODE *mknode(int data, NODE *left, NODE *right);` + - We can use this function to generate the leaf nodes as well + - also need node comparison + +``` +NODE *mknode(int data, NODE *left, NODE *right){ + NODE *np = malloc(...); + assert(...); + np->parent = NULL; + np->count = data; + //connect +} +``` + +3. `pq = createqueue()` + - call addEntry: `addEntry(pq, nodes[i]);` + - `l = removeEntry(pq);` + - `r = removeEntry(pq);` + - `new = mknode(l->count + r->conut, l, r)` + - `addEntry(pq, np);` + +4. print diff --git a/05-18_img1.png b/05-18_img1.png new file mode 100644 index 0000000..69deb3f Binary files /dev/null and b/05-18_img1.png differ diff --git a/05-18_img2.png b/05-18_img2.png new file mode 100644 index 0000000..bfc05fb Binary files /dev/null and b/05-18_img2.png differ -- cgit