Hashing

Note for Hashing

  • Hash function => The function is chosen at random.

Given a typically large universe $U$ of keys, and a positive integer $m$. A random hash function $h:U\to [m]$ is a randomly chosen function from $U\to [m]$.

My Understanding - 1

A random hash function is firstly a function that is selected from a set of hash functions randomly and it can map the keys from $U$ to a range of numbers ${0, … ,m-1}$.

Equivalently, => For each $x$, the value at $x$ is chosen at random.

It is a function $h$ such that for each $x \in U$, $h(x) \in [m]$ is a random variable.

My Understanding - 2

A random hash function is let each key in $U$ be the variable, and the result of hashing every time is random. For example, $h_i$ means $i^{th}$ hashing. $h_1(x)=a, h_2(x)=b$. $a$ $b$ are random variables.

Chinese Version

  1. 随机哈希函数首先是一个从一个含有多个hash functions的集合里随机挑选出来的方程,使得$U \to [m]$.
  2. 同样可以理解为一个哈希方程是让$U$里的每一个值作为哈希方程的自变量,每次对该自变量映射的结果都是随机的。
宏观上来看,每一个值在经过随机哈希后,输出的值是随机的。

Cryptographic hash functions such as MD5, SHA-1, and SHA-256 are not random hash functions.

  • Three things we care
  1. Space (seed size) needed to represent $h$. => the size of $S_h$, cannot be too big
  2. Time needed to calculate $h(x)$ given $x \in U$. => The inner part of a lot of algorithms is hashing.
  3. Properties of the random variable.
  • Hash function types

Truly random

A hash function $h:U\to[m]$ is truly random if the variables $h(x)$ for $x\in U$ are independent and uniform.

一个哈希方程想要 truly random,就得满足对于 $x\in U, h(x)$ 的结果每次都是 $m$ 种可能,每次hashing的结果互不影响(独立),且概率都一样,都是$\frac{1}{m}$(统一)。

IMG_FBAD332B7048-1

一共有$|U|$个输入,对于每一个输入,需要对应$m$个输出,此时一个输入需要$\log_2 m$字节在计算机里,则一共需要$|U|\log_2 m$个空间。

Universal

A random hash function $h:U\to[m]$ is universal if, for all $x\not= y\in U:\Pr [h(x)=h(y)]\le\frac{1}{m}$. => Hash to the same value.

C-approximately universal

A random hash function $h: U\to[m]$ is c-approximately universal if, for all $x\not= y \in U:\Pr[h(x)=h(y)]\le\frac{c}{m}$.

Strongly universal

A random hash function $h:U\to[m]$ is strongly universal (a.k.a. 2-independent) if,

  1. Each key is hashed uniformly into $[m]$. => i.e., $\forall x\in U, q\in [m]:\Pr[h(x)=q]=\frac{1}{m}$.
  2. Any two distinct keys hash independently.

Equivalently, if for all $x\not= y\in U$, and $q, r\in[m]:\Pr[h(x)=q \and h(y)=r]=\frac{1}{m^2}$.

C-approximately strongly universal

A random hash function $h:U\to[m]$ is c-approximately strongly universal if,

  1. Each key is hashed c-approximately uniformly into $[m]$. => i.e., $\forall x \in U, q\in[m]:\Pr[h(x)=q]\le\frac{c}{m}$
  2. Any two distinct keys hash independently.
  • Unordered sets / Hashing with chaining

Maintain a set $S$ of at most $n$ keys from some unordered universe $U$, under three operations.

INSERT(x, S) Insert key $x$ into $S$.

DELETE(x, S) Delete key $x$ from $S$.

MEMBER(x, S) Return $x\in S$.

We could use some form of balanced tree to store $S$, but they usually take $O(\log n)$ time operation, and we want each operation to run in expected constant time.

The worst case for both INSERT and DELETE is rotating $\log_2n$ times. And the worst case of MEMBER operation is finding the leaf node. That’s the reason why these three operations are all run in $O(\log n)$, while hashing can help us run these three operations in constant time. => Hashing with Chaining

  • Hashing with Chaining => Universal Hashing

hashing-2

Then, we store an array where the index of $i$ in this array is a head of a linked list that contains all the elements in our sets that hashed to that element.

这三个方法所花费时间都和链表长度成正比。$\downarrow$
Each operation take $O( L[h(x)] +1)$ time. And we need to prove the former part is a constant time.
  • Theorem - 1
For $x\notin S, \mathbb{E}[ L[h(x)] ]\le 1$. => 找不存在在集合里的$x$所花费的时间。某种程度上算是最差情况,如果最差情况也被bound住,那一般情况肯定在bound里。

Proof.

\[\begin{align*} \mathbb{E}[|L[h(x)]|] & = \mathbb{E}[|\{y\in S | h(y) = h(x) \}|] \Leftarrow By\ definition \\ &=\mathbb{E}\left[\sum_{y\in S}[h(y)=h(x)]\right] \Leftarrow Indicator\ variable \\ &= \sum_{y\in S}\mathbb{E}\left[[h(y)=h(x)]\right] \Leftarrow Linearity \ of \ expectation \\ &= \sum_{y \in S}\Pr[h(y)=h(x)] \Leftarrow Expectation \ of \ indicator \ variable \\ &\le |S|\frac{1}{m} \Leftarrow Since \ x\not=y \Rightarrow Universal \\ &=\frac{n}{m}\le 1 \end{align*}\]

This actually proves that hashing with chaining and expectation you use only constant time per operation.

  • Signatures => Universal Hashing

hashing-3

  • Multiply-mod-prime (2-approximately strongly universal)

It is the most classic but not the fastest. However, it is good enough for some applications.

hashing-4

  • Multiply-shift (2-approximately universal) => Universal Hashing

Extremely cheaper to compute.

  • Strong Multiply-shift => Strongly Universal Hashing

It is a strongly universal hash function.

  • Coordinated sampling => Strongly Universal Hashing

hashing-5

hashing-6

hashing-7

hashing-7

  • Lemma

hashing-8

  • How good is the unbiased estimate with Lemma?

hashing-9

hashing-10




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Line Segment Intersection
  • Graduation
  • Review of SwAV
  • Review of SimSiam
  • Extracts from BYOL