Hashing
Note for Hashing
- Hash function => The function is chosen at random.
Given a typically large universe $U$ of keys, and a positive integer $m$. A random hash function $h:U\to [m]$ is a randomly chosen function from $U\to [m]$.
My Understanding - 1
A random hash function is firstly a function that is selected from a set of hash functions randomly and it can map the keys from $U$ to a range of numbers ${0, … ,m-1}$.
Equivalently, => For each $x$, the value at $x$ is chosen at random.
It is a function $h$ such that for each $x \in U$, $h(x) \in [m]$ is a random variable.
My Understanding - 2
A random hash function is let each key in $U$ be the variable, and the result of hashing every time is random. For example, $h_i$ means $i^{th}$ hashing. $h_1(x)=a, h_2(x)=b$. $a$ $b$ are random variables.
Chinese Version
- 随机哈希函数首先是一个从一个含有多个hash functions的集合里随机挑选出来的方程,使得$U \to [m]$.
- 同样可以理解为一个哈希方程是让$U$里的每一个值作为哈希方程的自变量,每次对该自变量映射的结果都是随机的。
Cryptographic hash functions such as MD5, SHA-1, and SHA-256 are not random hash functions.
- Three things we care
- Space (seed size) needed to represent $h$. => the size of $S_h$, cannot be too big
- Time needed to calculate $h(x)$ given $x \in U$. => The inner part of a lot of algorithms is hashing.
- Properties of the random variable.
- Hash function types
Truly random
A hash function $h:U\to[m]$ is truly random if the variables $h(x)$ for $x\in U$ are independent and uniform.
一个哈希方程想要 truly random,就得满足对于 $x\in U, h(x)$ 的结果每次都是 $m$ 种可能,每次hashing的结果互不影响(独立),且概率都一样,都是$\frac{1}{m}$(统一)。一共有$|U|$个输入,对于每一个输入,需要对应$m$个输出,此时一个输入需要$\log_2 m$字节在计算机里,则一共需要$|U|\log_2 m$个空间。
Universal
A random hash function $h:U\to[m]$ is universal if, for all $x\not= y\in U:\Pr [h(x)=h(y)]\le\frac{1}{m}$. => Hash to the same value.
C-approximately universal
A random hash function $h: U\to[m]$ is c-approximately universal if, for all $x\not= y \in U:\Pr[h(x)=h(y)]\le\frac{c}{m}$.
Strongly universal
A random hash function $h:U\to[m]$ is strongly universal (a.k.a. 2-independent) if,
- Each key is hashed uniformly into $[m]$. => i.e., $\forall x\in U, q\in [m]:\Pr[h(x)=q]=\frac{1}{m}$.
- Any two distinct keys hash independently.
Equivalently, if for all $x\not= y\in U$, and $q, r\in[m]:\Pr[h(x)=q \and h(y)=r]=\frac{1}{m^2}$.
C-approximately strongly universal
A random hash function $h:U\to[m]$ is c-approximately strongly universal if,
- Each key is hashed c-approximately uniformly into $[m]$. => i.e., $\forall x \in U, q\in[m]:\Pr[h(x)=q]\le\frac{c}{m}$
- Any two distinct keys hash independently.
- Unordered sets / Hashing with chaining
Maintain a set $S$ of at most $n$ keys from some unordered universe $U$, under three operations.
INSERT(x, S)
Insert key $x$ into $S$.
DELETE(x, S)
Delete key $x$ from $S$.
MEMBER(x, S)
Return $x\in S$.
We could use some form of balanced tree to store $S$, but they usually take $O(\log n)$ time operation, and we want each operation to run in expected constant time.
The worst case for both INSERT
and DELETE
is rotating $\log_2n$ times. And the worst case of MEMBER
operation is finding the leaf node. That’s the reason why these three operations are all run in $O(\log n)$, while hashing can help us run these three operations in constant time. => Hashing with Chaining
- Hashing with Chaining => Universal Hashing
Then, we store an array where the index of $i$ in this array is a head of a linked list that contains all the elements in our sets that hashed to that element.
这三个方法所花费时间都和链表长度成正比。$\downarrow$Each operation take $O( | L[h(x)] | +1)$ time. And we need to prove the former part is a constant time. |
- Theorem - 1
For $x\notin S, \mathbb{E}[ | L[h(x)] | ]\le 1$. => 找不存在在集合里的$x$所花费的时间。某种程度上算是最差情况,如果最差情况也被bound住,那一般情况肯定在bound里。 |
Proof.
\[\begin{align*} \mathbb{E}[|L[h(x)]|] & = \mathbb{E}[|\{y\in S | h(y) = h(x) \}|] \Leftarrow By\ definition \\ &=\mathbb{E}\left[\sum_{y\in S}[h(y)=h(x)]\right] \Leftarrow Indicator\ variable \\ &= \sum_{y\in S}\mathbb{E}\left[[h(y)=h(x)]\right] \Leftarrow Linearity \ of \ expectation \\ &= \sum_{y \in S}\Pr[h(y)=h(x)] \Leftarrow Expectation \ of \ indicator \ variable \\ &\le |S|\frac{1}{m} \Leftarrow Since \ x\not=y \Rightarrow Universal \\ &=\frac{n}{m}\le 1 \end{align*}\]This actually proves that hashing with chaining and expectation you use only constant time per operation.
- Signatures => Universal Hashing
- Multiply-mod-prime (2-approximately strongly universal)
It is the most classic but not the fastest. However, it is good enough for some applications.
- Multiply-shift (2-approximately universal) => Universal Hashing
Extremely cheaper to compute.
- Strong Multiply-shift => Strongly Universal Hashing
It is a strongly universal hash function.
- Coordinated sampling => Strongly Universal Hashing
- Lemma
- How good is the unbiased estimate with Lemma?
Enjoy Reading This Article?
Here are some more articles you might like to read next: