Note for Hashing

Hash function => The function is chosen at random.

Given a typically large universe $U$ of keys, and a positive integer $m$. A random hash function $h:U\to [m]$ is a randomly chosen function from $U\to [m]$.

My Understanding - 1

A random hash function is firstly a function that is selected from a set of hash functions randomly and it can map the keys from $U$ to a range of numbers ${0, … ,m-1}$.

Equivalently, => For each $x$, the value at $x$ is chosen at random.

It is a function $h$ such that for each $x \in U$, $h(x) \in [m]$ is a random variable.

My Understanding - 2

A random hash function is let each key in $U$ be the variable, and the result of hashing every time is random. For example, $h_i$ means $i^{th}$ hashing. $h_1(x)=a, h_2(x)=b$. $a$ $b$ are random variables.

Chinese Version

随机哈希函数首先是一个从一个含有多个hash functions的集合里随机挑选出来的方程，使得$U \to [m]$.
同样可以理解为一个哈希方程是让$U$里的每一个值作为哈希方程的自变量，每次对该自变量映射的结果都是随机的。

宏观上来看，每一个值在经过随机哈希后，输出的值是随机的。

Cryptographic hash functions such as MD5, SHA-1, and SHA-256 are not random hash functions.

Three things we care

Space (seed size) needed to represent $h$. => the size of $S_h$, cannot be too big
Time needed to calculate $h(x)$ given $x \in U$. => The inner part of a lot of algorithms is hashing.
Properties of the random variable.

Hash function types

Truly random

A hash function $h:U\to[m]$ is truly random if the variables $h(x)$ for $x\in U$ are independent and uniform.

一个哈希方程想要 truly random，就得满足对于 $x\in U, h(x)$ 的结果每次都是 $m$ 种可能，每次hashing的结果互不影响（独立），且概率都一样，都是$\frac{1}{m}$（统一）。

IMG_FBAD332B7048-1

一共有$|U|$个输入，对于每一个输入，需要对应$m$个输出，此时一个输入需要$\log_2 m$字节在计算机里，则一共需要$|U|\log_2 m$个空间。

Universal

A random hash function $h:U\to[m]$ is universal if, for all $x\not= y\in U:\Pr [h(x)=h(y)]\le\frac{1}{m}$. => Hash to the same value.

C-approximately universal

A random hash function $h: U\to[m]$ is c-approximately universal if, for all $x\not= y \in U:\Pr[h(x)=h(y)]\le\frac{c}{m}$.

Strongly universal

A random hash function $h:U\to[m]$ is strongly universal (a.k.a. 2-independent) if,

Each key is hashed uniformly into $[m]$. => i.e., $\forall x\in U, q\in [m]:\Pr[h(x)=q]=\frac{1}{m}$.
Any two distinct keys hash independently.

Equivalently, if for all $x\not= y\in U$, and $q, r\in[m]:\Pr[h(x)=q \and h(y)=r]=\frac{1}{m^2}$.

C-approximately strongly universal

A random hash function $h:U\to[m]$ is c-approximately strongly universal if,

Each key is hashed c-approximately uniformly into $[m]$. => i.e., $\forall x \in U, q\in[m]:\Pr[h(x)=q]\le\frac{c}{m}$
Any two distinct keys hash independently.

Unordered sets / Hashing with chaining

Maintain a set $S$ of at most $n$ keys from some unordered universe $U$, under three operations.

INSERT(x, S) Insert key $x$ into $S$.

DELETE(x, S) Delete key $x$ from $S$.

MEMBER(x, S) Return $x\in S$.

We could use some form of balanced tree to store $S$, but they usually take $O(\log n)$ time operation, and we want each operation to run in expected constant time.

The worst case for both INSERT and DELETE is rotating $\log_2n$ times. And the worst case of MEMBER operation is finding the leaf node. That’s the reason why these three operations are all run in $O(\log n)$, while hashing can help us run these three operations in constant time. => Hashing with Chaining

Hashing with Chaining => Universal Hashing

hashing-2

Then, we store an array where the index of $i$ in this array is a head of a linked list that contains all the elements in our sets that hashed to that element.

这三个方法所花费时间都和链表长度成正比。$\downarrow$

Each operation take $O(

L[h(x)]

+1)$ time. And we need to prove the former part is a constant time.

Theorem - 1

For $x\notin S, \mathbb{E}[

L[h(x)]

]\le 1$. => 找不存在在集合里的$x$所花费的时间。某种程度上算是最差情况，如果最差情况也被bound住，那一般情况肯定在bound里。

Proof.

\[\begin{align*} \mathbb{E}[|L[h(x)]|] & = \mathbb{E}[|\{y\in S | h(y) = h(x) \}|] \Leftarrow By\ definition \\ &=\mathbb{E}\left[\sum_{y\in S}[h(y)=h(x)]\right] \Leftarrow Indicator\ variable \\ &= \sum_{y\in S}\mathbb{E}\left[[h(y)=h(x)]\right] \Leftarrow Linearity \ of \ expectation \\ &= \sum_{y \in S}\Pr[h(y)=h(x)] \Leftarrow Expectation \ of \ indicator \ variable \\ &\le |S|\frac{1}{m} \Leftarrow Since \ x\not=y \Rightarrow Universal \\ &=\frac{n}{m}\le 1 \end{align*}\]

This actually proves that hashing with chaining and expectation you use only constant time per operation.