From 183f93c9c02e10001275229536d789c5c304a1fd Mon Sep 17 00:00:00 2001 From: lshprung Date: Mon, 20 Apr 2020 10:02:52 -0700 Subject: Post-class 04/20 --- 04-20.md | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 04-20.md (limited to '04-20.md') diff --git a/04-20.md b/04-20.md new file mode 100644 index 0000000..7514236 --- /dev/null +++ b/04-20.md @@ -0,0 +1,135 @@ +[\<- 04/15](04-15.md) + +--- + +## Review What We Have Learned So Far + +|Basic Operation|Unsorted Array|Sorted Array| +|---------------|--------------|------------| +|**Search Find**|O(n) |O(log(n)) | +|**Insert** |O(1) |O(n) | +|**Delete** |O(1) |O(n) | +|**Min/Max** |O(n) |O(1) | + +- Search Find + - Unsorted would use sequential search + - Sorted would use binary search + +- Insert + - Unsorted: you can just insert at the end + - Sorted: need to determine where to insert so that the array stays ordered + +- Delete + - Unsorted: can just swap the last entry in for the deleted entry and make the array length minus 1 + - Sorted: need to shift all indices after deleted index down + +- Min/Max + - Unsorted: Need to traverse entire array + - Sorted: Min and max are the first and last indices + +## Making it Faster + +- Let's work on making searching in an array even faster: O(1) + +## Hashing + +- A key-to-address mapping process + +- Ex. + - We have keys + - x + - y + - We have address + - 000 + - 001 + - 002 + - ... + - 100 + - We have a function `h()` + - `h(x)` is the **address** for **key** x, `h(y)` is the **address** for **key** y + - to search: `a[h(x)]`... O(1) + +## Outline + +- Q: I want to know if someone recieved a given score x on the exam. Each score should range from 0 to 100. How could I do this quickly? +- A: Have an array in which each score has its own slot +- What is the hashing method `h(x)`? + +``` +h(x) = x; +``` + +- It's called **direct hashing**! + +## Example 2 + +- Let's try something else. We're reading words from a book, so let's ask if all the letters were used +- Q: What if our inputs were letters? How to construct `h()`? + - We have an array from index 0 to 51. The first half is for lower case letters. The second half for upper case letters + - 'a' should go to index 0 + - 'b' should go to index 1... etc + - We will take advantage of the ASCII table + +``` +if('a' <= x <= 'z') h(x) = x-'a'; +if('A' <= x <= 'Z') h(x) = x-'A'+26; //offset to the second half of the array +``` + +- The Big-O runtime of this is O(1) +- This is so-called **subtraction method** + +## Example 3 + +- Ex. What if I want to know whether someone in the university earns a certain salary? And I know that the lowest paid person is #30k and highest paid is $200k, and salaries are in increments of $1k... so... +- Q: How would you map them? + - There should be an array where index 0 is 30k, 1 is 31k, 2 is 32k, etc. The final index is 200k. + +``` +h(x) = (x - 30k)/1k +``` + +## Concepts + +- What we're doing in these examples is trying to find a function, `h(x)`, that takes as inputs some arbitrary data and maps them onto 0 to m-1, which we can then use ar array indices +- This technique is called **hashing** +- The array indexed by such a function is called a **hash table** +- The function itself is called the **hash function** +- It turns out that if we know the keys in advance then we can always derive a **perfect hash function** (each key hashes to its own unique location). **One-to-one mapping** + +## Example 4 + +- What if I wanted to use you ID numbers, but map them to no more than 100 slots? + - Using the first digit... is terrible (will not work) + - Using the last digit... a little better, but would only work with at most 10 students + - Using the last two digits... this might work, but there is **hashing collision** if two students have the same last two digits + - Multiply by 17, divide by 97 and take the remainder... maybe better, but no guaruntee that there won't be collision + +In practice, we don't know the keys in advance, it is very hard to derive a perfect hash function + +## A New Hashing Method - Modulo Arithmetic + +- The simplest way to make sure a hash function always maps within 0..m-1 is to take the result and take the remainder fo dividing by m +- Since our hash function needs to yield values between "0" and "m-1" (m is max), we usually do: `h(x) % m` +- We like to pick m as a **prime number**, since it will generate less collisions (you don't have to understand why) + +## Modulo Arithmetic + +- Let's have a simple example + - m = 7 + - k = 8, 14, 12, 2, 4 + - `h(k) = k % m` + +- h(8) -> go to index 1 +- h(14) -> go to index 2 +- h(12) -> go to index 5 +- h(2) -> go to index 2 +- h(4) -> go to index 4 + +- If we insert 11... + - h(11) -> go to index 4 + - There is multiple keys mapped to the same index now -> collision :( + +## Collision Resolutions + +- Even if we choose a good hash function that does a good job distributing the keys across the entire hash table, we will still have collisions in practice and need to deal with them +- The next step will be to cover **Collision Resolutions**... -- cgit