From b80e8d3b7de4d5bda0d20f36612279fc62d431ef Mon Sep 17 00:00:00 2001 From: lshprung Date: Wed, 22 Apr 2020 10:41:03 -0700 Subject: Post-class 04/22 --- 04-20.md | 57 +++++++++++++++++++ 04-22.md | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 04-22.md diff --git a/04-20.md b/04-20.md index 7514236..88e0142 100644 --- a/04-20.md +++ b/04-20.md @@ -133,3 +133,60 @@ In practice, we don't know the keys in advance, it is very hard to derive a perf - Even if we choose a good hash function that does a good job distributing the keys across the entire hash table, we will still have collisions in practice and need to deal with them - The next step will be to cover **Collision Resolutions**... + +## Probing + +- A simple way of resolving collisions: **probing** +- If a key 'y' collids with a key 'x' that is already in the table, 'x' is **not** displaced. Rather 'y' must be placed somewhere else +- In probing, the colliding key tries another slot in the table (**probes**), if that slot is empty, then is is placed there, otherwise it keeps trying (**probing**) different slots +- Probing methods covered in this class: + - Linear Probing + - Quadratic Probing + - Double Hashing + +## Linear Probing + +- We can think of our hash function as having an extra parameter `h(k, i)`, where 'i' is the **probe number** that starts at zero for each key (i.e., it's the number of the try) +- Simplest type of probing: + - `h(k, i) = (h(k) + i) % m` + - This is called **linear probing** + +## Linear Probing (Insertion) + +- If i=0, then `h(k, i) = h(k)`, i.e., the "home" location of the key. If that is occupied, we use the next location to the right, wrapping around to the front of the array +- Example, for m=11, let's have number between 0 and 100 and map them to the array + - k=20 -> go to index 9 + - k=40 -> go to index 4 + - k=15 -> go to index 4 (but that location is already taken by k=40) + - i++ -> go to index 5 + +- Another example: + - m = 9 + - k = 0, 9, 18, 27, 36, 45, 54 + - k=0 -> go to index 0 + - k=9 -> go to index 0 (but it's taken) + - i++ -> go to index 1 + - k=18 -> go to index 0 (but it's taken) + - i++ -> go to index 1 (but it's taken) + - i++ -> go to index 2 + - k=27 -> go to index 0 (but it's taken) + - i++ -> go to index 1 (but it's taken) + - i++ -> go to index 2 (but it's taken) + - i++ -> go to index 3 + - If all keys map to the same location -> BAD! + - What is the big-O runtime for inserting a new key (assume there are already 'n' keys in the table) + - O(n) (pretty bad) + +- In the **worst case**, if all keys hash to the same location, insertion is **O(n)** +- General case **linear probing**: + - `h(k, i) = (h(k) + c*i) % m` + - e.g. c=2, m=7, `h(k, i) = (h(k) + 2*i) % m`, k=0,2,7 + - k=0 -> go to index 0 + - k=2 -> go to index 2 + - k=7 -> go to index 0 (but it's taken) + - i++ -> go to index 2 (but it's taken) + - i++ -> go to index 4 + +--- + +[04/22 ->](04-22.md) diff --git a/04-22.md b/04-22.md new file mode 100644 index 0000000..58e0510 --- /dev/null +++ b/04-22.md @@ -0,0 +1,193 @@ +[\<- 04/20](04-20.md) + +--- + +## Linear Probing (Search) + +- Okay, we know how to insert, how do we find something? + - If the element is in the hash table + - (m = 11, `h(k,i) = (h(k) + i) % m`) + - Keys 11, 13, and 22 were inserted already + - How to search for 22? + - If the element is not in the hash table + - How to search 44? + - When to stop? + +Visual Representation of the Hash Table: + +|||||||||||| +|-|-|-|-|-|-|-|-|-|-|-| +|11|22|13| | | | | | | | | + +- We use the same algorithm for searching as we did for insertion. We probe. If we come to an empty spot, we can stop and say the key is not in the table +- So what's the worst-case search cost? why? + - e.g. m = 11 + +|||||||||||| +|-|-|-|-|-|-|-|-|-|-|-| +|11|22|13|14|15|33| | | | | | + +- Searching for 33 + - 33 % 11 = 0 -> not there + - i++ -> not there + - i++ -> not there + - etc. + - since it will try all the indices, the big-O runtime will be O(n) +- Searching for 44 + - 44 % 11 = 0 -> not there + - i++ -> not there + - i++ -> not there + - etc. until it gets to an empty slot + - since it will try all the indices (and then one more), the big-O runtime will be O(n) + +## Linear Probing (problems) + +- Linear probing is prone to **primary clustering** (long run of filled locations) +- \ + +# Efficiency Analysis of Linear Probing + +## Assumption + +- A reasonable assumption for hashing behavior. **Simple Uniform Hashing** +- In simple uniform hashing, **keys are equally likely to hash to any location**. With a decent hash function, this is a reasonable assumption + - For example, for a hash table half filled, there is a 50% chance collision will occur, and 50% chance collision will not occur + - Probability of collision is referred to as "alpha" + - Probability of non-collision is referred to as "1 - alpha" + +|||||||||||| +|-|-|-|-|-|-|-|-|-|-|-| +|filled| |filled| |filled| |filled| |filled| |filled| + +## Expected Insertion Cost + +- Assuming **simple uniform hashing** what is the **expected insertion cost**? + - Show what happens for half-full table, 90% full table + - Let alpha = n/m (the **load factor** of the hash table) 0 <= alpha <= 1 + - Percent of time you'll have a collision? + - How many times you have to try for a successful insertion? + - i.e., no collisions, 1 collision, 2 collisions... + - This is a geometric series and converges to: + - **1/(1 - alpha)** + - **100-slot table** or **100,000-slot table** with alpha = 0.9, still takes on average 10 attempts + +|#Collisions|Probability|#Of Tries| +|-----------|-----------|---------| +|0 |(1 - alpha)|1 | +|1 |alpha(1 - alpha)|2 | +|2 |((alpha)^2)(1 - alpha)|3| +|3 |((alpha)^3)(1 - alpha)|4| +|n |((alpha)^n)(1 - alpha)|n+1| + +- Total # of tries should be: + - (1 - alpha) + 2(alpha)(1 - alpha) + 3(alpha^2)(1 - alpha) + ... + (n+1)(alpha^n)(1 - alpha) + - It can be rewritten as (1 - alpha)(1 + 2(alpha) + 3(alpha^2) + 4(alpha^3) + ... + (n + 1)(alpha^n)) + - You might recognize that this is a geometric series that gives us: + - derivative of (alpha/(1 - alpha)) -> (1/(1-alpha)^2) + - 1/(1 - alpha) + +## Expected vs. Worst Case + +- If alpha = 0.5, we expect it to take 2 tries to insert a key. This does not (directly) depend on the number of keys present +- Therefore we say that the expected case run time for isnertion is O(1). (And therefore also for search ... we're simplifying here... you'll cover more in algorithms) +- Expected is good for many applications, but not e.g., missile defense... the worst case is still O(n) with linear probing + +|Set|Hash| +|---|----| +|**Search/Find**|O(n) (expected O(1))| +|**Add**|O(n) (expected O(1))| + +## Probing + +- Linear probing is simple but has the disadvantage of **primary clustering** +- What can we do? + - **Quadratic Probing**! + +--- + +## Quadratic Probing + +- Hash Function + - `h(k, i) = (h(k) + i^2) % 2` + - In general: + - `h(k, i) = (h(k) + (c1 * i) + (c2 * i^2)) % m` + +- Ex. + - let m=13 + - let `h(k, i) = (h(k) + i^2) % m` + - let k = 3, 4, 26, 2, 66, 0 + - k = 3 -> insert in index 3 + - k = 4 -> insert in index 4 + - k = 26 -> insert in index 0 + - k = 2 -> insert in index 2 + - k = 66 -> insert in index 1 + - k = 0 -> insert in index 0 (collision) + - i=1 -> insert in index 1 (collision) + - i=2 -> insert in index 4 (collision) + - i=3 -> insert in index 9 + +- Essentially, we are increasing the step size + +## Quadratic Probing (Problem 1) + +- ex. m = 4 + - `h(k) = (k + i^2) % m` + - k = 0, 1, 4 + - 4 will keep colliding with 0, then 1 + - k = 0 -> insert in index 0 + - k = 1 -> insert in index 1 + - k = 4 -> insert in index 0 (collision) + - i=1 -> insert in index 1 (collision) + - i=2 -> insert in index 0 (collision) + - i=3 -> insert in index 1 (collision) + - i=4 -> insert in index 0 (collision) + - :( + +- A problem with quadratic probing is that you are **not guaranteed** to find an empty slot, even though one exists +- \ + +## Quadratic Probing (Problem 2) + +- Ex. + - let m = 13 + - let `h(k, i) = (h(k) + i^2) % 2` + - let k = 3, 4, 26, 2, 66, 0, 39, 13 + - k = 3 -> insert in index 3 + - k = 4 -> insert in index 4 + - k = 26 -> insert in index 0 + - k = 2 -> insert in index 2 + - k = 66 -> insert in index 1 + - k = 0 -> insert in index 0 (collision) + - i=1 -> insert in index 1 (collision) + - i=2 -> insert in index 4 (collision) + - i=3 -> insert in index 9 + - k = 39 -> insert in index 0 (collision) + - i=1 -> insert in index 1 (collision) + - i=2 -> insert in index 4 (collision) + - i=3 -> insert in index 9 (collision) + - i=4 -> insert in index 3 (collision) + - i=5 -> insert in index 12 + - k = 13 -> insert in index 0 (collision) + - i=1 -> insert in index 1 (collision) + - i=2 -> insert in index 4 (collision) + - i=3 -> insert in index 9 (collision) + - i=4 -> insert in index 3 (collision) + - i=5 -> insert in index 12 (collision) + - i=6 -> insert in index 10 + +- Quadratic probing avoids the problem of primary clustering, but instead has **secondary clustering** +- **Secondary clustering** is a long "run" of filled locations **along a probe sequence** +- If many keys hash to the same location, those collisions will all fill the same probe sequence +- \ +- \ +- Is there a way to address the secondary clustering problem? + +## Double Hashing + +- `h(k, i) = (h1(k) + h2(k) * i) % m` + - \ + +- The **probe increment** is not fixed, but **is determined by the key itself** +- Why use double hashing? + - It avoids secondary clustering +- Again, there is no guarantee we will find an empty slot, unless we pick our constants very carefully -- cgit