summaryrefslogtreecommitdiff
path: root/04-20.md
blob: 88e01424f7032e3448c4ea8af654bd25ed0bba99 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
[\<- 04/15](04-15.md)

---

## Review What We Have Learned So Far

|Basic Operation|Unsorted Array|Sorted Array|
|---------------|--------------|------------|
|**Search Find**|O(n)          |O(log(n))   |
|**Insert**     |O(1)          |O(n)        |
|**Delete**     |O(1)          |O(n)        |
|**Min/Max**    |O(n)          |O(1)        |

- Search Find
	- Unsorted would use sequential search
	- Sorted would use binary search

- Insert
	- Unsorted: you can just insert at the end
	- Sorted: need to determine where to insert so that the array stays ordered

- Delete
	- Unsorted: can just swap the last entry in for the deleted entry and make the array length minus 1
	- Sorted: need to shift all indices after deleted index down

- Min/Max
	- Unsorted: Need to traverse entire array
	- Sorted: Min and max are the first and last indices

## Making it Faster

- Let's work on making searching in an array even faster: O(1)

## Hashing

- A key-to-address mapping process

- Ex.
	- We have keys
		- x
		- y
	- We have address
		- 000
		- 001
		- 002
		- ...
		- 100
	- We have a function `h()`
	- `h(x)` is the **address** for **key** x, `h(y)` is the **address** for **key** y
	- to search: `a[h(x)]`... O(1)

## Outline

- Q: I want to know if someone recieved a given score x on the exam. Each score should range from 0 to 100. How could I do this quickly?
- A: Have an array in which each score has its own slot
- What is the hashing method `h(x)`?

```
h(x) = x;
```

- It's called **direct hashing**!

## Example 2

- Let's try something else. We're reading words from a book, so let's ask if all the letters were used
- Q: What if our inputs were letters? How to construct `h()`?
	- We have an array from index 0 to 51. The first half is for lower case letters. The second half for upper case letters
		- 'a' should go to index 0
		- 'b' should go to index 1... etc
		- We will take advantage of the ASCII table

```
if('a' <= x <= 'z') h(x) = x-'a';
if('A' <= x <= 'Z') h(x) = x-'A'+26; //offset to the second half of the array
```

- The Big-O runtime of this is O(1)
- This is so-called **subtraction method**

## Example 3

- Ex. What if I want to know whether someone in the university earns a certain salary? And I know that the lowest paid person is #30k and highest paid is $200k, and salaries are in increments of $1k... so...
- Q: How would you map them?
	- There should be an array where index 0 is 30k, 1 is 31k, 2 is 32k, etc. The final index is 200k.

```
h(x) = (x - 30k)/1k
```

## Concepts

- What we're doing in these examples is trying to find a function, `h(x)`, that takes as inputs some arbitrary data and maps them onto 0 to m-1, which we can then use ar array indices
- This technique is called **hashing**
- The array indexed by such a function is called a **hash table**
- The function itself is called the **hash function**
- It turns out that if we know the keys in advance then we can always derive a **perfect hash function** (each key hashes to its own unique location). **One-to-one mapping**

## Example 4

- What if I wanted to use you ID numbers, but map them to no more than 100 slots?
	- Using the first digit... is terrible (will not work)
	- Using the last digit... a little better, but would only work with at most 10 students
	- Using the last two digits... this might work, but there is **hashing collision** if two students have the same last two digits
	- Multiply by 17, divide by 97 and take the remainder... maybe better, but no guaruntee that there won't be collision

In practice, we don't know the keys in advance, it is very hard to derive a perfect hash function

## A New Hashing Method - Modulo Arithmetic

- The simplest way to make sure a hash function always maps within 0..m-1 is to take the result and take the remainder fo dividing by m
- Since our hash function needs to yield values between "0" and "m-1" (m is max), we usually do: `h(x) % m`
- We like to pick m as a **prime number**, since it will generate less collisions (you don't have to understand why)

## Modulo Arithmetic

- Let's have a simple example
	- m = 7
	- k = 8, 14, 12, 2, 4
	- `h(k) = k % m`

- h(8) -> go to index 1
- h(14) -> go to index 2
- h(12) -> go to index 5
- h(2) -> go to index 2
- h(4) -> go to index 4

- If we insert 11...
	- h(11) -> go to index 4
		- There is multiple keys mapped to the same index now -> collision :(

## Collision Resolutions

- Even if we choose a good hash function that does a good job distributing the keys across the entire hash table, we will still have collisions in practice and need to deal with them
- The next step will be to cover **Collision Resolutions**...

## Probing

- A simple way of resolving collisions: **probing**
- If a key 'y' collids with a key 'x' that is already in the table, 'x' is **not** displaced. Rather 'y' must be placed somewhere else
- In probing, the colliding key tries another slot in the table (**probes**), if that slot is empty, then is is placed there, otherwise it keeps trying (**probing**) different slots
- Probing methods covered in this class:
	- Linear Probing
	- Quadratic Probing
	- Double Hashing

## Linear Probing

- We can think of our hash function as having an extra parameter `h(k, i)`, where 'i' is the **probe number** that starts at zero for each key (i.e., it's the number of the try)
- Simplest type of probing:
	- `h(k, i) = (h(k) + i) % m`
	- This is called **linear probing**

## Linear Probing (Insertion)

- If i=0, then `h(k, i) = h(k)`, i.e., the "home" location of the key. If that is occupied, we use the next location to the right, wrapping around to the front of the array
- Example, for m=11, let's have number between 0 and 100 and map them to the array
	- k=20 -> go to index 9
	- k=40 -> go to index 4
	- k=15 -> go to index 4 (but that location is already taken by k=40)
		- i++ -> go to index 5

- Another example:
	- m = 9
	- k = 0, 9, 18, 27, 36, 45, 54
	- k=0 -> go to index 0
	- k=9 -> go to index 0 (but it's taken)
		- i++ -> go to index 1
	- k=18 -> go to index 0 (but it's taken)
		- i++ -> go to index 1 (but it's taken)
		- i++ -> go to index 2
	- k=27 -> go to index 0 (but it's taken)
		- i++ -> go to index 1 (but it's taken)
		- i++ -> go to index 2 (but it's taken)
		- i++ -> go to index 3
	- If all keys map to the same location -> BAD!
	- What is the big-O runtime for inserting a new key (assume there are already 'n' keys in the table)
		- O(n) (pretty bad)

- In the **worst case**, if all keys hash to the same location, insertion is **O(n)**
- General case **linear probing**:
	- `h(k, i) = (h(k) + c*i) % m`
	- e.g. c=2, m=7, `h(k, i) = (h(k) + 2*i) % m`, k=0,2,7
		- k=0 -> go to index 0
		- k=2 -> go to index 2
		- k=7 -> go to index 0 (but it's taken)
			- i++ -> go to index 2 (but it's taken)
			- i++ -> go to index 4

---

[04/22 ->](04-22.md)