[\<- 04/20](04-20.md) --- ## Linear Probing (Search) - Okay, we know how to insert, how do we find something? - If the element is in the hash table - (m = 11, `h(k,i) = (h(k) + i) % m`) - Keys 11, 13, and 22 were inserted already - How to search for 22? - If the element is not in the hash table - How to search 44? - When to stop? Visual Representation of the Hash Table: |||||||||||| |-|-|-|-|-|-|-|-|-|-|-| |11|22|13| | | | | | | | | - We use the same algorithm for searching as we did for insertion. We probe. If we come to an empty spot, we can stop and say the key is not in the table - So what's the worst-case search cost? why? - e.g. m = 11 |||||||||||| |-|-|-|-|-|-|-|-|-|-|-| |11|22|13|14|15|33| | | | | | - Searching for 33 - 33 % 11 = 0 -> not there - i++ -> not there - i++ -> not there - etc. - since it will try all the indices, the big-O runtime will be O(n) - Searching for 44 - 44 % 11 = 0 -> not there - i++ -> not there - i++ -> not there - etc. until it gets to an empty slot - since it will try all the indices (and then one more), the big-O runtime will be O(n) ## Linear Probing (problems) - Linear probing is prone to **primary clustering** (long run of filled locations) - \ # Efficiency Analysis of Linear Probing ## Assumption - A reasonable assumption for hashing behavior. **Simple Uniform Hashing** - In simple uniform hashing, **keys are equally likely to hash to any location**. With a decent hash function, this is a reasonable assumption - For example, for a hash table half filled, there is a 50% chance collision will occur, and 50% chance collision will not occur - Probability of collision is referred to as "alpha" - Probability of non-collision is referred to as "1 - alpha" |||||||||||| |-|-|-|-|-|-|-|-|-|-|-| |filled| |filled| |filled| |filled| |filled| |filled| ## Expected Insertion Cost - Assuming **simple uniform hashing** what is the **expected insertion cost**? - Show what happens for half-full table, 90% full table - Let alpha = n/m (the **load factor** of the hash table) 0 <= alpha <= 1 - Percent of time you'll have a collision? - How many times you have to try for a successful insertion? - i.e., no collisions, 1 collision, 2 collisions... - This is a geometric series and converges to: - **1/(1 - alpha)** - **100-slot table** or **100,000-slot table** with alpha = 0.9, still takes on average 10 attempts |#Collisions|Probability|#Of Tries| |-----------|-----------|---------| |0 |(1 - alpha)|1 | |1 |alpha(1 - alpha)|2 | |2 |((alpha)^2)(1 - alpha)|3| |3 |((alpha)^3)(1 - alpha)|4| |n |((alpha)^n)(1 - alpha)|n+1| - Total # of tries should be: - (1 - alpha) + 2(alpha)(1 - alpha) + 3(alpha^2)(1 - alpha) + ... + (n+1)(alpha^n)(1 - alpha) - It can be rewritten as (1 - alpha)(1 + 2(alpha) + 3(alpha^2) + 4(alpha^3) + ... + (n + 1)(alpha^n)) - You might recognize that this is a geometric series that gives us: - derivative of (alpha/(1 - alpha)) -> (1/(1-alpha)^2) - 1/(1 - alpha) ## Expected vs. Worst Case - If alpha = 0.5, we expect it to take 2 tries to insert a key. This does not (directly) depend on the number of keys present - Therefore we say that the expected case run time for isnertion is O(1). (And therefore also for search ... we're simplifying here... you'll cover more in algorithms) - Expected is good for many applications, but not e.g., missile defense... the worst case is still O(n) with linear probing |Set|Hash| |---|----| |**Search/Find**|O(n) (expected O(1))| |**Add**|O(n) (expected O(1))| ## Probing - Linear probing is simple but has the disadvantage of **primary clustering** - What can we do? - **Quadratic Probing**! --- ## Quadratic Probing - Hash Function - `h(k, i) = (h(k) + i^2) % 2` - In general: - `h(k, i) = (h(k) + (c1 * i) + (c2 * i^2)) % m` - Ex. - let m=13 - let `h(k, i) = (h(k) + i^2) % m` - let k = 3, 4, 26, 2, 66, 0 - k = 3 -> insert in index 3 - k = 4 -> insert in index 4 - k = 26 -> insert in index 0 - k = 2 -> insert in index 2 - k = 66 -> insert in index 1 - k = 0 -> insert in index 0 (collision) - i=1 -> insert in index 1 (collision) - i=2 -> insert in index 4 (collision) - i=3 -> insert in index 9 - Essentially, we are increasing the step size ## Quadratic Probing (Problem 1) - ex. m = 4 - `h(k) = (k + i^2) % m` - k = 0, 1, 4 - 4 will keep colliding with 0, then 1 - k = 0 -> insert in index 0 - k = 1 -> insert in index 1 - k = 4 -> insert in index 0 (collision) - i=1 -> insert in index 1 (collision) - i=2 -> insert in index 0 (collision) - i=3 -> insert in index 1 (collision) - i=4 -> insert in index 0 (collision) - :( - A problem with quadratic probing is that you are **not guaranteed** to find an empty slot, even though one exists - \ ## Quadratic Probing (Problem 2) - Ex. - let m = 13 - let `h(k, i) = (h(k) + i^2) % 2` - let k = 3, 4, 26, 2, 66, 0, 39, 13 - k = 3 -> insert in index 3 - k = 4 -> insert in index 4 - k = 26 -> insert in index 0 - k = 2 -> insert in index 2 - k = 66 -> insert in index 1 - k = 0 -> insert in index 0 (collision) - i=1 -> insert in index 1 (collision) - i=2 -> insert in index 4 (collision) - i=3 -> insert in index 9 - k = 39 -> insert in index 0 (collision) - i=1 -> insert in index 1 (collision) - i=2 -> insert in index 4 (collision) - i=3 -> insert in index 9 (collision) - i=4 -> insert in index 3 (collision) - i=5 -> insert in index 12 - k = 13 -> insert in index 0 (collision) - i=1 -> insert in index 1 (collision) - i=2 -> insert in index 4 (collision) - i=3 -> insert in index 9 (collision) - i=4 -> insert in index 3 (collision) - i=5 -> insert in index 12 (collision) - i=6 -> insert in index 10 - Quadratic probing avoids the problem of primary clustering, but instead has **secondary clustering** - **Secondary clustering** is a long "run" of filled locations **along a probe sequence** - If many keys hash to the same location, those collisions will all fill the same probe sequence - \ - \ - Is there a way to address the secondary clustering problem? ## Double Hashing - `h(k, i) = (h1(k) + h2(k) * i) % m` - \ - The **probe increment** is not fixed, but **is determined by the key itself** - Why use double hashing? - It avoids secondary clustering - Again, there is no guarantee we will find an empty slot, unless we pick our constants very carefully --- [04/24 ->](04-24.md)