Private Set Intersection

Private Set Intersection (PSI) is a cryptographic protocol that allows two parties, each holding a set of items, to compute the intersection of those sets without revealing any information about items that are not in the intersection.

Imagine Alice and Bob want to find out which friends they have in common from their contact lists. The naive solution is for one to send their entire list to the other—a complete sacrifice of privacy. Private Set Intersection (PSI) is a cryptographic protocol that solves this problem. It allows two parties, each holding a set of items, to compute the intersection of those sets without revealing any information about items that are not in the intersection.

The protocol we will describe is a direct and powerful application of the k-out-of-N Oblivious Transfer we formalized previously. In this construction, one party (Alice) will learn the final intersection, while the other party (Bob) will learn nothing, not even the size of the intersection.

Parties, Inputs, and Primitives:

  • Alice: Holds a set of items $S_A = \{a_1, a_2, \ldots, a_n\}$. In this protocol, she will act as the Receiver in the OT subroutine.
  • Bob: Holds a set of items $S_B = \{b_1, b_2, \ldots, b_m\}$. He will act as the Sender in the OT subroutine.
  • Cryptographic Hash Function ($H$): A function $H$ that maps items from the sets to a large range of indices, i.e., $H: \{0,1\}^* \to \{0, \ldots, N-1\}$, where $N$ is a sufficiently large number. $N$ must be large enough to make accidental hash collisions unlikely.
  • Subroutine: A secure k-out-of-N Oblivious Transfer protocol ($\text{OT}^{k}_{N}$).

Protocol Steps

The protocol can be broken down into four distinct phases.

Phase 1: Bob (OT Sender) Prepares the Database

Bob's goal is to place his items into a large "database" (an array) at positions determined by their hash values. This allows Alice to query the database by hash without Bob knowing which items she is interested in.

  1. Database Initialization: Bob creates a large array, $D$, of size $N$, and initializes every slot with a distinct, random dummy value. For example, $D[i] = \text\{random\_dummy\}_i$ for all $i \in \{0, \ldots, N-1\}$.
  2. Item Insertion: For each item $b_j$ in his set $S_B$:
    a. Bob computes its hash to get an index: $h_j = H(b_j)$.
    b. He then places his original item $b_j$ at that index in the database: $D[h_j] = b_j$.

At the end of this phase, Bob has a database $D$ where a few slots contain his actual items, and the vast majority contain random data. This database $D$ will be his input to the $\text{OT}^{k}_{N}$ protocol.

Phase 2: Alice (OT Receiver) Prepares Her Choices

Alice's goal is to generate a set of indices to query from Bob's database. She will query the locations corresponding to the hashes of her own items.

  1. Index Set Generation: Alice computes a set of indices, $C_A$, by applying the same hash function $H$ to each of her items $a_i$ in her set $S_A$:
    $$C_A = \{ H(a_i) \mid a_i \in S_A \}$$
  2. The size of this choice set is $k = |C_A|$. Note that $k \le n$ since some of Alice's items might hash to the same index. This set $C_A$ will be her choice set for the $\text{OT}^{k}_{N}$ protocol.

Phase 3: k-out-of-N Oblivious Transfer

This is the core interactive phase where the private transfer of information occurs.

  1. Execution: Alice and Bob engage in a $\text{OT}^{k}_{N}$ protocol.
    • Bob acts as the Sender, and his input is the database $D$ of $N$ items he prepared in Phase 1.
    • Alice acts as the Receiver, and her input is her choice set $C_A$ containing $k$ indices.
  2. Result: At the end of the protocol, Alice receives a set, $R$, containing the $k$ items from Bob's database at the indices she requested:
    $$R = \{D[h] \mid h \in C_A\}$$
    Due to the properties of OT, Bob has no information about which indices in $C_A$ Alice queried.

Phase 4: Alice Computes the Intersection

Alice now holds a set $R$ which contains a mix of Bob's actual items (if there was a match) and random dummy values. She can now compute the final intersection locally.

  1. Initialization: Alice creates an empty set, $I$, which will store the intersection.
  2. Filtering: For each value $v$ she received in the set $R$:
    a. Alice checks if $v$ is present in her original set $S_A$.
    b. If $v \in S_A$, it means that Bob also had this item (otherwise she would have received a dummy value). She adds this item to her intersection set: $I = I \cup {v}$.
    c. If $v \notin S_A$, it must be a random dummy value, and she simply discards it.
  3. Final Result: The set $I$ now contains the exact intersection of the two sets: $I = S_A \cap S_B$.

Security: Alice's privacy is protected because she never reveals her set $S_A$ to Bob; she only uses hashes of her items as inputs to an OT protocol. Bob's privacy is protected because Alice only learns the items that are at the specific indices she queried, and these indices correspond to items she already possesses. She learns nothing about any of Bob's other items.