scispace - formally typeset
Open AccessJournal ArticleDOI

5PM: Secure pattern matching

Reads0
Chats0
TLDR
The problem of secure pattern matching that allows single-character wildcards and substring matching in the malicious stand-alone setting is considered and the first secure expressive pattern matching protocol designed to optimize round complexity by carefully specifying the entire protocol round by round is considered.
Abstract
In this paper we consider the problem of secure pattern matching that allows single-character wildcards and substring matching in the malicious stand-alone setting. Our protocol, called 5PM, is executed between two parties: Server, holding a text of length n, and Client, holding a pattern of length m to be matched against the text, where our notion of matching is more general than traditionally considered and includes non-binary alphabets, non-binary Hamming distance and non-binary substring matching.5PM is the first secure expressive pattern matching protocol designed to optimize round complexity by carefully specifying the entire protocol round by round. 5PM requires only eight rounds in the malicious static corruptions model. In the malicious model, 5PM requires O((m+n)k2) communication complexity and O(m+n) encryptions, where m is the pattern length and n is the text length. Further, 5PM can hide pattern size with no asymptotic additional costs in either computation or bandwidth.

read more

Content maybe subject to copyright    Report

5PM: Secure Pattern Matching
?
Joshua Baron,
2
Karim El Defrawy,
2
Kirill Minkovich,
2
Rafail Ostrovsky,
1
and Eric Tressler
2
1
Departments of Mathematics and Computer Science, UCLA, Los Angeles, CA, USA 90095
2
Information and System Sciences Laboratory, HRL Laboratories, LLC, Malibu, CA, USA, 90265
{jwbaron,kmeldefrawy,kminkovich,eptressler}@hrl.com, rafail@cs.ucla.edu
Abstract. In this paper we consider the problem of secure pattern matching that allows single-
character wildcards and substring matching in the malicious (stand-alone) setting. Our protocol, called
5PM, is executed between two parties: Server, holding a text of length n, and Client, holding a pattern
of length m to be matched against the text, where our notion of matching is more general and includes
non-binary alphabets, non-binary Hamming distance and non-binary substring matching.
5PM is the first secure expressive pattern matching protocol designed to optimize round complexity
by carefully specifying the entire protocol round by round. In the malicious model, 5PM requires
O((m + n)k
2
) bandwidth and O(m + n) encryptions, where m is the pattern length and n is the text
length. Further, 5PM can hide pattern size with no asymptotic additional costs in either computation or
bandwidth. Finally, 5PM requires only two rounds of communication in the honest-but-curious model
and eight rounds in the malicious model. Our techniques reduce pattern matching and generalized
Hamming distance problems to a novel linear algebra formulation that allows for generic solutions
based on any additively homomorphic encryption. We believe our efficient algebraic techniques are of
independent interest.
1 Introduction
Pattern matching is fundamental to computer science. It is used in many areas, including text
processing, database search [1], networking and security applications [2] and recently in the context
of bioinformatics and DNA analysis [3,4,5]. It is a problem that has been extensively studied, re-
sulting in several efficient (although insecure) techniques to solve its many variations, e.g., [6,7,8,9].
The most common interpretation of the pattern matching problem is the following: given a finite
alphabet Σ, a text T Σ
n
and a pattern p Σ
m
, the exact pattern matching decision problem
requires one to decide whether or not a pattern appears in the text. The exact pattern matching
search problem requires finding all indices i of T (if any) where p occurs as a substring starting
at position i. If we denote by T
i
the ith character of T , the output should be the set of match-
ing positions MP
:
= {i | p matches T beginning at T
i
}. The following generalizations of the exact
matching problem are often encountered, where the output in all cases is the set MP :
Pattern matching with single-character wildcards
1
: There is a special character / Σ that
matches any single-character of the alphabet, where p {Σ {∗}}
m
and T Σ
n
. Using such
?
This work was done while the first author was at UCLA. The work of the first and fourth author is supported in
part by NSF grants CCF-0916574, IIS-1065276, CCF-1016540, CNS-1118126, CNS-1136174, and by US-Israel BSF
grant 2008411. It was also supported by the OKAWA Foundation Research Award, IBM Faculty Research Award,
Xerox Faculty Research Award, B. John Garrick Foundation Award, Teradata Research Award and Lockheed-
Martin Corporation Research Award. The material contained herein is also based upon work supported by the
Defense Advanced Research Projects Agency through the U.S. Office of Naval Research under Contract N00014-
11-1-0392. The views expressed are those of the author and do not reflect the official policy or position of the
Department of Defense or the U.S. Government. The authors would like to thank Jonathan Katz, Sky Faber and
Matt Cheung for helpful discussions and comments.
1
Such wildcards are also called “do not cares” and “mismatches” in the literature.
1
c
2011 HRL Laboratories, LLC. All Rights Reserved

Paper NB Hamming Exact Wildcard NB Substring Security
Distance Matching Matching Matching
[13] No Yes No No HBC/M
[14] Yes
Yes Yes Yes
HBC/M
[15] Yes No
∗∗
No
∗∗
No
∗∗
HBC
5PM Yes Yes Yes Yes HBC/M
Table 1. Comparison of previous protocol functionality, NB=non-binary HBC=honest but curious,
M=malicious, *=using unary encoding and additional tools, **=can be extended
a “wildcard” character allows one pattern to be specified that could match several sequences
of characters. For example the pattern T A would match any of the following character
sequence in a text
2
: T AA, T AC, T AG, and T AT .
Substring pattern matching: Fix some l m; a match for p is found whenever there exists in
T an m-length string that differs in l characters from p (i.e., has Hamming distance l from
p). For example, the pattern T AC has m = 3. If l = 1, then any of the following words will
match: AC, T C, or T A; note that this is an example of non-binary substring matching.
A secure version of pattern matching has many applications. For example, secure pattern matching
can help secure databases that contain medical information such as DNA records, while still al-
lowing one to perform pattern matching operations on such data. The need for privacy-preserving
DNA matching has been highlighted in recent papers [10,11,12]. In addition to the case of DNA
matching, where substring matching may be particularly useful, Hamming distance-based approx-
imate matching has also been demonstrated in the case of secure facial recognition [3]. We note
that both of these settings require computation over non-binary alphabets.
1.1 Our Contributions
This paper presents 5ecure Pattern Matching (or 5PM), a new protocol for arbitrary alphabets that
addresses, in addition to exact matching, more expressive search queries including single-character
wildcards and substring pattern matching, and also provides the ability to hide pattern length.
5PM has communication complexity sublinear in circuit size (as opposed to general MPC, which
has communication complexity linear in circuit size) to securely compute non-binary substring
matching in the malicious model. In addition, our extension of Hamming distance computation
to substring matching has minimal overhead; our protocol makes a single computation pass per
text element, even for multiple Hamming distance values, and therefore is able to securely compute
non-binary substring matching efficiently (see Table 1 for a comparison of protocol functionality
and Tables 2 and 3 for a comparison of protocol overhead).
5PM performs exact, single-character wildcards, and substring pattern matching in the honest-
but-curious and malicious (static corruption) models. Our malicious model protocol requires O((m+
n)k
2
) bandwidth complexity. Further, our protocol can be specified to require two (one-way) rounds
of communication in the semi-honest model and eight (one-way) rounds of communication in the
malicious model.
We construct our protocols by reducing the problems of Hamming distance and pattern match-
ing, including single-character wildcards and substring matching, to a sequence of linear operations.
2
Here and throughout, we use the DNA alphabet (Σ = {A, C, G, T }) for examples.
2

Paper Encryptions Exponentiations Multiplications Bandwidth Rds
[16] O(mn) O(mn) O(mn) O(mnk
2
) O(1)
[14] O(n + m) O(n log m) O(nm) O((n + m)k
2
) O(1)
5PM O(n + m) O(nm) O(nm) O((n + m)k
2
) 8
Table 2. Detailed comparison with [14] and [16] for single-character wildcards and substring match-
ing in malicious model with text length=n, pattern length=m, security parameter=k, rounds=Rds.
Paper Encryptions Exponentiations Multiplications Bandwidth Rds
[15] O(n + m) O(nm) O(nm) O((nm)k) O(1)
5PM O(n + m) O(n + m) O(nm) O((n + m)k) 2
Table 3. Detailed comparison with [15] for non-binary substring matching in HBC model with
text length=n, pattern length=m, security parameter=k, rounds=Rds.
We then rely on the observation that these linear operations, such as the inner products and matrix
multiplication, can be efficiently computed in the malicious model using additively homomorphic
encryption schemes.
The security requirements (informally) dictate that the party holding the text learns nothing
except the upper bound on the length of the pattern, while the party holding the pattern only
learns either a binary (yes/no) answer for the decision problem or the matching positions (if any),
and nothing else.
1.2 Comparison to Previous Work
Exact Matching. In the exact pattern matching setting, the algorithm of Freedman, Ishai,
Pinkas and Reingold [13] achieves polylogarithmic overhead in m and n and polynomial overhead
in security parameters in the honest-but-curious setting. Using efficient arguments [17,18] with
the modern probabilistically checkable proofs (PCPs) of proximity [19], one can extend (at least
asymptotically) their results to the malicious (static corruption) model. However, the protocol in
[13] works only for exact matching and does not address more general problems, including single-
character wildcards and substring matching, which are the main focus of our work. Other protocols
that address secure exact matching (and not wildcard or substring matching) are [12,20,21,22,23,11];
of these, only [22] obtains (full) security in the malicious setting. We note that [23] is more efficient
than [13], but only in the random oracle model; here, we are interested in standard security models.
Single-Character Wildcards and Substring Matching. Recently, Vergnaud [14] built on
the work of Hazay and Toft [16] to construct an efficient secure pattern matching scheme for wildcard
matching and substring matching (requiring t runs over the preliminary matching result to search
for t different Hamming distance values, which is also required by 5PM) in the malicious adversary
model. More specifically, [14,16] take advantage of the fact that (p
i
t
i
)
2
equals 0 if binary values
p
i
and t
i
are equal and 1 if they are not equal; therefore, binary Hamming distance can essentially
be computed by counting the number of 1s in a particular polynomial-based computation. However,
when p
i
and t
i
are non-binary, it is unknown how to obtain 0 when p
i
and t
i
equal, and 1 (or some
other fixed value) when they are not equal using oblivious polynomial evaluations.
However, non-binary elements can be computed by unary encoding; that is, an element α Σ
can be encoded as an element α
0
{0, 1}
|Σ|
with all 0s except for a single 1 in the place representing
3

α (lexicographically). There are two subtleties of such an approach. The first is that if α 6= β, then
α
0
and β
0
will have Hamming distance 2 instead of 1; the second is, in the malicious case, zero
knowledge proofs are needed to demonstrate that α
0
is well formed.
[14] requires O(m + n) encryptions, O(n log m) exponentiations, O(nm) multiplications (of en-
crypted elements), and O(n +m) bandwidth, all in a constant number of rounds. By contrast, 5PM
has the same overhead except for O(nm) exponentiations (see Table 2). However, our work is of
interest for several reasons. The first is that we have implemented our protocol and believe it to
be more efficient (additional work is needed on this front). The second is that our techniques are
of independent interest and may be extended to additional functionalities. Finally, the protocol
presented here is fully specified; by contrast, additional work is needed to transform the work of
[14] into a protocol that can support non-binary alphabets for substring matching or to calculate
Hamming distance in the malicious case.
Non-binary Hamming Distance. Jarrous and Pinkas [15] gave the first construction of a
secure protocol for computing non-binary Hamming distances. In order to count the non-binary
mismatches, they leverage 1-out-of-2 oblivious transfers. 5PM can also compute non-binary Ham-
ming distance even when the text and pattern have the same length (and where the output is not
blinded to only reveal whether or not a pattern match occurred). We note that [15] can be used to
implement exact and substring matching with additional tools to blind Hamming distance output
(for instance, see [14]). [15], to compare two strings of length n, requires O(n) 1-out-of-2 OTs, O(n)
multiplications of encryptions and O(nk) bandwidth, while 5PM requires O(n) exponentiations
(which require less computation than OTs), O(n
2
) multiplications, and O(nk) bandwidth. The ad-
vantage of 5PM over [15] is twofold: the first is that 5PM is proven secure in the malicious model
while [15] is not; the second is that 5PM, in both the honest-but-curious and malicious models,
amortizes well in the substring matching setting, while [15] does not amortize because it cannot
reuse OT outputs to compute substring matching (see Table 3).
Other Techniques. In the most general case, secure exact, approximate and single-character
wildcards pattern matching is an instance of general secure two-party computation techniques (for
instance, [24,25,26,27]). All of these schemes have bandwidth and computational complexity at
best linear in the circuit size. For instance, a naive implementation of Yao [24] requires bandwidth
O(mn) in the security parameter. In contrast, we aim for a protocol where circuit size is O(mn),
yet we achieve communication complexity of O(m + n).
Finally, we observe that with the construction of fully homomorphic encryption (FHE) schemes
[28], the following “folklore” construction can be executed for any pattern matching algorithm:
Client encrypts its pattern using an FHE scheme and sends it to Server. Server applies the ap-
propriate pattern matching circuit to the encrypted pattern (where the circuit output is a yes/no
indicating whether a match exists or not), and sends the FHE circuit output to Client. Client
decrypts to obtain the answer. Such a scheme requires O(m) bandwidth, but since FHE schemes
are not yet practical, we view the 5PM protocol outlined here as an efficient and practical solution
to secure pattern matching with single-character wildcards and substring matching.
4

2 Preliminaries
The rationale behind our secure 5PM protocol is based on a modification of an insecure pattern
matching algorithm (IPM) [29] that can perform exact matching, exact matching with single-
character wildcards, and substring matching within the same algorithm. In Section 3.1, we show
how our modified algorithm can be reduced to basic linear operations whose secure and efficient
evaluation allows us to obtain our 5PM protocol.
2.1 Insecure Pattern Matching (IPM) Algorithm
To illustrate how our modified algorithm works, we begin by describing how it performs exact
matching; we then show how it handles single-character wildcards and substring matching.
2.1.1 Exact Matching. IPM involves the following steps:
a. Inputs: An alphabet Σ, a text T Σ
n
and a pattern p Σ
m
.
b. Initialization: For each character in Σ, the algorithm constructs a vector, here termed a
Character Delay Vector (CDV ), of length equal to the pattern length, m. These vectors
are initialized with zeros. For example, if the pattern is: T ACT over Σ = {A, C, G, T }, then
the CDV s will be initialized to: CDV (A) = [0, 0, 0, 0], CDV (C) = [0, 0, 0, 0], CDV (G) =
[0, 0, 0, 0] and CDV (T ) = [0, 0, 0, 0].
c. Pattern preprocessing: For each pattern character p
i
(i {1, ..., m}), a delay value, d
r
p
i
, is
computed to be the number of characters from p
i
to the end of the pattern, i.e., d
r
p
i
= m i
for the rth occurrence of p
i
in p. The d
r
p
i
th position of CDV (p
i
) is set to 1. For example, the
CDV s of T ACT would be:
CDV (A) = [0, 0, 1, 0] because d
1
A
= 4 2 = 2
CDV (C) = [0, 1, 0, 0] because d
1
C
= 4 3 = 1
CDV (G) = [0, 0, 0, 0] because G 6∈ p
CDV (T ) = [1, 0, 0, 1] because d
1
T
= 4 4 = 0 and d
2
T
= 4 1 = 3
d. Matching pass and comparison with pattern length: A vector of length n called the Activation
Vector (AV ) is constructed, and its elements are initialized with zeros. For each input
text character T
j
, CDV (T
j
) is added element-wise to the AV from position j to position
min(n, j +m1). To determine if there was a pattern match in the text, after these operations
the algorithm checks (when j m) if AV
j
= m. If so, then the match started at position
j m + 1. The value j m + 1 is added to the set of matching positions (MP ). Note that
n AV
j
is the non-binary Hamming distance of the pattern and the text starting at position
j m + 1.
The intuition behind the algorithm is that when an input text character matches a character
in the pattern, the algorithm optimistically assumes that the following characters will correspond
to the rest of the pattern characters. It then adds a 1 at the position in the activation vector
several steps ahead, where it would expect the pattern to end (if the character appears in multiple
positions in the pattern, it adds a 1 to all the corresponding positions where the pattern might
end). If all subsequent characters are indeed characters in the pattern, then at the position where
a pattern would end the number of added 1s will sum up to the pattern length; otherwise the sum
will be strictly less than the pattern length. This algorithm does not incur false positives and always
indicates when (and where) a pattern occurs if it exists, as shown in [29].
5

Citations
More filters
Proceedings ArticleDOI

Secure pattern matching using somewhat homomorphic encryption

TL;DR: This paper makes use of the somewhat homomorphic encryption scheme presented by Lauter, Naehrig and Vaikuntanathan (ACM CCSW 2011), which supports a limited number of both additions and multiplications on encrypted data and proposes a new packing method suitable for an efficient computation of multiple Hamming distance values onencrypted data.
Journal ArticleDOI

Substring-Searchable Symmetric Encryption

TL;DR: It is proved security of the substring-searchable encryption scheme against malicious adversaries, where the query protocol leaks limited information about memory access patterns through the suffix tree of the encrypted string.
Journal ArticleDOI

Computationally Secure Pattern Matching in the Presence of Malicious Adversaries

TL;DR: The construction guarantees full simulation in the presence of malicious, polynomial-time adversaries (assuming the hardness of DDH assumption) and exhibits computation and communication costs of O(n+m) group elements in a constant round complexity.
Journal ArticleDOI

New packing method in somewhat homomorphic encryption and its applications

TL;DR: This paper presents two types of packed ciphertexts, one of which is based on the message encoding technique proposed by Brakerski and Vaikuntanathan, and enables efficient secure computation of more complex functionalities such as multiple inner products and multiple Hamming distances.
Proceedings ArticleDOI

Generalized pattern matching string search on encrypted data in cloud systems

TL;DR: This paper proposes a scheme for Generalized Pattern-matching String-search on Encrypted data (GPSE) in cloud systems and implements two most commonly used pattern matching search functions on encrypted data, the substring matching and the longest-prefix-first matching.
References
More filters
Journal ArticleDOI

Fast Pattern Matching in Strings

TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.
Book ChapterDOI

Non-Interactive and Information-Theoretic Secure Verifiable Secret Sharing

TL;DR: It is shown how to distribute a secret to n persons such that each person can verify that he has received correct information about the secret without talking with other persons.
Book ChapterDOI

Efficient Identification and Signatures for Smart Cards

TL;DR: An efficient interactive identification scheme and a related signature scheme that are based on discrete logarithms and which are particularly suited for smart cards are presented.
Journal ArticleDOI

Efficient randomized pattern-matching algorithms

TL;DR: In this article, the first occurrence of a string X as a consecutive block within a text Y is found by using a randomized algorithm. But the algorithm requires a constant number of storage locations, and essentially runs in real time.
Book ChapterDOI

Proofs of Partial Knowledge and Simplified Design of Witness Hiding Protocols

TL;DR: In this paper, the authors show how to transform a proof of knowledge P into a witness indistinguishable protocol, in which the prover demonstrates knowledge of the solution to some subset of n problem instances out of a collection of subsets denned by a secret sharing scheme S on n participants.