summaryrefslogtreecommitdiff
path: root/04-22.md
blob: 8b401f76aabc9f0720d43fff98ba2379622b89d8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
[\<- 04/20](04-20.md)

---

## Linear Probing (Search)

- Okay, we know how to insert, how do we find something?
	- If the element is in the hash table
		- (m = 11, `h(k,i) = (h(k) + i) % m`)
		- Keys 11, 13, and 22 were inserted already
		- How to search for 22?
	- If the element is not in the hash table
		- How to search 44?
		- When to stop?

Visual Representation of the Hash Table:

||||||||||||
|-|-|-|-|-|-|-|-|-|-|-|
|11|22|13| | | | | | | | |

- We use the same algorithm for searching as we did for insertion. We probe. If we come to an empty spot, we can stop and say the key is not in the table
- So what's the worst-case search cost? why?
	- e.g. m = 11

||||||||||||
|-|-|-|-|-|-|-|-|-|-|-|
|11|22|13|14|15|33| | | | | |

- Searching for 33
	- 33 % 11 = 0 -> not there
	- i++ -> not there
	- i++ -> not there
	- etc.
	- since it will try all the indices, the big-O runtime will be O(n)
- Searching for 44
	- 44 % 11 = 0 -> not there
	- i++ -> not there
	- i++ -> not there
	- etc. until it gets to an empty slot
	- since it will try all the indices (and then one more), the big-O runtime will be O(n)

## Linear Probing (problems)

- Linear probing is prone to **primary clustering** (long run of filled locations)
- \<it takes a long time to get to the end of a cluster, and then you wind up just adding to its length>

# Efficiency Analysis of Linear Probing

## Assumption

- A reasonable assumption for hashing behavior. **Simple Uniform Hashing**
- In simple uniform hashing, **keys are equally likely to hash to any location**. With a decent hash function, this is a reasonable assumption
	- For example, for a hash table half filled, there is a 50% chance collision will occur, and 50% chance collision will not occur
		- Probability of collision is referred to as "alpha"
		- Probability of non-collision is referred to as "1 - alpha"

||||||||||||
|-|-|-|-|-|-|-|-|-|-|-|
|filled| |filled| |filled| |filled| |filled| |filled|

## Expected Insertion Cost

- Assuming **simple uniform hashing** what is the **expected insertion cost**?
	- Show what happens for half-full table, 90% full table
	- Let alpha = n/m (the **load factor** of the hash table) 0 <= alpha <= 1
	- Percent of time you'll have a collision?
	- How many times you have to try for a successful insertion?
		- i.e., no collisions, 1 collision, 2 collisions...
	- This is a geometric series and converges to:
		- **1/(1 - alpha)**
		- **100-slot table** or **100,000-slot table** with alpha = 0.9, still takes on average 10 attempts

|#Collisions|Probability|#Of Tries|
|-----------|-----------|---------|
|0          |(1 - alpha)|1        |
|1          |alpha(1 - alpha)|2   |
|2          |((alpha)^2)(1 - alpha)|3|
|3          |((alpha)^3)(1 - alpha)|4|
|n          |((alpha)^n)(1 - alpha)|n+1|

- Total # of tries should be:
	- (1 - alpha) + 2(alpha)(1 - alpha) + 3(alpha^2)(1 - alpha) + ... + (n+1)(alpha^n)(1 - alpha)
	- It can be rewritten as (1 - alpha)(1 + 2(alpha) + 3(alpha^2) + 4(alpha^3) + ... + (n + 1)(alpha^n))
	- You might recognize that this is a geometric series that gives us:
		- derivative of (alpha/(1 - alpha)) -> (1/(1-alpha)^2)
	- 1/(1 - alpha)

## Expected vs. Worst Case

- If alpha = 0.5, we expect it to take 2 tries to insert a key. This does not (directly) depend on the number of keys present
- Therefore we say that the expected case run time for isnertion is O(1). (And therefore also for search ... we're simplifying here... you'll cover more in algorithms)
- Expected is good for many applications, but not e.g., missile defense... the worst case is still O(n) with linear probing

|Set|Hash|
|---|----|
|**Search/Find**|O(n) (expected O(1))|
|**Add**|O(n) (expected O(1))|

## Probing

- Linear probing is simple but has the disadvantage of **primary clustering**
- What can we do?
	- **Quadratic Probing**!

---

## Quadratic Probing

- Hash Function
	- `h(k, i) = (h(k) + i^2) % 2`
	- In general:
		- `h(k, i) = (h(k) + (c1 * i) + (c2 * i^2)) % m`

- Ex.
	- let m=13
	- let `h(k, i) = (h(k) + i^2) % m`
	- let k = 3, 4, 26, 2, 66, 0
	- k = 3  -> insert in index 3
	- k = 4  -> insert in index 4
	- k = 26 -> insert in index 0
	- k = 2  -> insert in index 2
	- k = 66 -> insert in index 1
	- k = 0  -> insert in index 0 (collision)
		- i=1 -> insert in index 1 (collision)
		- i=2 -> insert in index 4 (collision)
		- i=3 -> insert in index 9

- Essentially, we are increasing the step size

## Quadratic Probing (Problem 1)

- ex. m = 4
	- `h(k) = (k + i^2) % m`
	- k = 0, 1, 4
	- 4 will keep colliding with 0, then 1
	- k = 0 -> insert in index 0
	- k = 1 -> insert in index 1
	- k = 4 -> insert in index 0 (collision)
		 - i=1 -> insert in index 1 (collision)
		 - i=2 -> insert in index 0 (collision)
		 - i=3 -> insert in index 1 (collision)
		 - i=4 -> insert in index 0 (collision)
		 - :(

- A problem with quadratic probing is that you are **not guaranteed** to find an empty slot, even though one exists
- \<if you **make your hash table prime**, then you'll be fine>

## Quadratic Probing (Problem 2)

- Ex.
	- let m = 13
	- let `h(k, i) = (h(k) + i^2) % 2`
	- let k = 3, 4, 26, 2, 66, 0, 39, 13
	- k = 3  -> insert in index 3
	- k = 4  -> insert in index 4
	- k = 26 -> insert in index 0
	- k = 2  -> insert in index 2
	- k = 66 -> insert in index 1
	- k = 0  -> insert in index 0 (collision)
		- i=1 -> insert in index 1 (collision)
		- i=2 -> insert in index 4 (collision)
		- i=3 -> insert in index 9
	- k = 39 -> insert in index 0 (collision)
		- i=1 -> insert in index 1 (collision)
		- i=2 -> insert in index 4 (collision)
		- i=3 -> insert in index 9 (collision)
		- i=4 -> insert in index 3 (collision)
		- i=5 -> insert in index 12 
	- k = 13 -> insert in index 0 (collision)
		- i=1 -> insert in index 1 (collision)
		- i=2 -> insert in index 4 (collision)
		- i=3 -> insert in index 9 (collision)
		- i=4 -> insert in index 3 (collision)
		- i=5 -> insert in index 12 (collision)
		- i=6 -> insert in index 10

- Quadratic probing avoids the problem of primary clustering, but instead has **secondary clustering**
- **Secondary clustering** is a long "run" of filled locations **along a probe sequence**
- If many keys hash to the same location, those collisions will all fill the same probe sequence
- \<you don't "see" the cluster, but it's there>
- \<still it's much better>
- Is there a way to address the secondary clustering problem?

## Double Hashing

- `h(k, i) = (h1(k) + h2(k) * i) % m`
	- \<Ignore what the books says on double hashing... use wikipedia>

- The **probe increment** is not fixed, but **is determined by the key itself**
- Why use double hashing?
	- It avoids secondary clustering
- Again, there is no guarantee we will find an empty slot, unless we pick our constants very carefully

---

[04/24 ->](04-24.md)