Showing papers by "Nikos Mamoulis published in 2008"

PDF

Open Access

Journal Article•DOI•

Privacy-preserving anonymization of set-valued data

[...]

Manolis Terrovitis¹, Nikos Mamoulis¹, Panos Kalnis²•Institutions (2)

University of Hong Kong¹, National University of Singapore²

01 Aug 2008

TL;DR: A new version of the k-anonymity guarantee is defined, the km-Anonymity, to limit the effects of the data dimensionality and two efficient algorithms to transform the database are proposed.

...read moreread less

Abstract: In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the km-anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets.

...read moreread less

324 citations

Proceedings Article•DOI•

Privacy Preservation in the Publication of Trajectories

[...]

Manolis Terrovitis¹, Nikos Mamoulis¹•Institutions (1)

University of Hong Kong¹

27 Apr 2008

TL;DR: It is shown that one can use partial trajectory knowledge as a quasi-identifier for the remaining locations in the sequence and device a data suppression technique, which prevents this type of breach, while keeping the posted data as accurate as possible.

...read moreread less

Abstract: We study the problem of protecting privacy in the publication of location sequences. Consider a database of trajectories, corresponding to movements of people, captured by their transactions when they use credit or RFID debit cards. We show that, if such trajectories are published exactly (by only hiding the identities of persons that followed them), there is a high risk of privacy breach by adversaries who hold partial information about them (e.g., shop owners). In particular, we show that one can use partial trajectory knowledge as a quasi-identifier for the remaining locations in the sequence. We device a data suppression technique, which prevents this type of breach, while keeping the posted data as accurate as possible.

...read moreread less

306 citations

Journal Article•DOI•

The Bdual-Tree: indexing moving objects by space filling curves in the dual space

[...]

Man Lung Yiu¹, Yufei Tao², Nikos Mamoulis¹•Institutions (2)

University of Hong Kong¹, The Chinese University of Hong Kong²

01 May 2008

TL;DR: It is shown, with theoretical evidence, that the Bdual-tree indeed outperforms the Bx-tree in most circum- stances, and the technique can effectively answer progressive spatiotemporal queries, which are poorly supported by BX-trees.

...read moreread less

Abstract: Existing spatiotemporal indexes suffer from either large update cost or poor query performance, except for the B x -tree (the state-of-the-art), which consists of multiple B +-trees indexing the 1D values transformed from the (multi-dimensional) moving objects based on a space filling curve (Hilbert, in particular). This curve, however, does not consider object velocities, and as a result, query processing with a B x -tree retrieves a large number of false hits, which seriously compromises its efficiency. It is natural to wonder "can we obtain better performance by capturing also the velocity information, using a Hilbert curve of a higher dimensionality?". This paper provides a positive answer by developing the B dual -tree, a novel spatiotemporal access method leveraging pure relational methodology. We show, with theoretical evidence, that the B dual -tree indeed outperforms the B x -tree in most circum- stances. Furthermore, our technique can effectively answer progressive spatiotemporal queries, which are poorly supported by B x -trees.

...read moreread less

118 citations

Proceedings Article•DOI•

Capacity constrained assignment in spatial databases

[...]

Leong Hou U¹, Man Lung Yiu², Kyriakos Mouratidis³, Nikos Mamoulis¹•Institutions (3)

University of Hong Kong¹, Aalborg University², Singapore Management University³

09 Jun 2008

TL;DR: Efficient algorithms foroptimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem are proposed that provide a trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees.

...read moreread less

Abstract: Given a point set P of customers (e.g., WiFi receivers) and a point set Q of service providers (e.g., wireless access points), where each q ∈ Q has a capacity q.k, the capacity constrained assignment (CCA) is a matching M ⊆ Q × P such that (i) each point q ∈ Q (p ∈ P) appears at most k times (at most once) in M, (ii) the size of M is maximized (i.e., it comprises min{|P|, ∑q∈Qq.k} pairs), and (iii) the total assignment cost (i.e., the sum of Euclidean distances within all pairs) is minimized. Thus, the CCA problem is to identify the assignment with the optimal overall quality; intuitively, the quality of q's service to p in a given (q, p) pair is anti-proportional to their distance. Although max-flow algorithms are applicable to this problem, they require the complete distance-based bipartite graph between Q and P. For large spatial datasets, this graph is expensive to compute and it may be too large to fit in main memory. Motivated by this fact, we propose efficient algorithms for optimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem. Additionally, we develop approximate (i.e., suboptimal) CCA solutions that provide a trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees. A thorough experimental evaluation demonstrates the efficiency and practicality of the proposed techniques.

...read moreread less

62 citations

Journal Article•DOI•

Hierarchical synopses with optimal error guarantees

[...]

Panagiotis Karras¹, Nikos Mamoulis²•Institutions (2)

National University of Singapore¹, University of Hong Kong²

03 Sep 2008-ACM Transactions on Database Systems

TL;DR: This article formally proves that the CHH, in its default binary-hierarchy form, is a simplified variant of a Haar+ tree, and confirms the theoretically expected superiority of Haar+.

...read moreread less

Abstract: Hierarchical synopsis structures offer a viable alternative in terms of efficiency and flexibility in relation to traditional summarization techniques such as histograms. Previous research on such structures has mostly focused on a single model, based on the Haar wavelet decomposition. In previous work, we have introduced a more refined, wavelet-inspired hierarchical index structure for synopsis construction: the Haar+ tree. The chief advantages of this structure are twofold. First, it achieves higher synopsis quality at the task of summarizing data sets with sharp discontinuities than state-of-the-art histogram and Haar wavelet techniques. Second, thanks to its search space delimitation capacity, Haar+ synopsis construction operates in time linear in the size of the data set for any monotonic distributive error metric. Contemporaneous research has introduced another hierarchical synopsis structure, the compact hierarchical histogram (CHH). In this article, we elaborate on both these structures. First, we formally prove that the CHH, in its default binary-hierarchy form, is a simplified variant of a Haar+ tree. We then focus on the summarization problem, with both these hierarchical synopsis structures, in which an error guarantee expressed by a maximum-error metric is required. We show that this problem is most efficiently solved through its dual, space-minimization counterpart, which can also achieve optimal quality. In this case, there is a benefit to be gained by specializing the algorithm for each structure; hence, our algorithm for optimal-quality maximum-error CHH requires low polynomial time; on the other hand, optimal-quality Haar+ synopses for maximum-error metrics are constructed in exponential time; hence, we also develop a low-polynomial-time approximation scheme for the maximum-error Haar+ case. Furthermore, we extend our approach for both general-error and maximum-error Haar+ synopses to arbitrary dimensionality. In our experimental study, (i) we confirm the theoretically expected superiority of Haar+ synopses over Haar wavelet methods in both construction time and achieved quality for representative error metrics; (ii) we demonstrate that Haar+ synopses are also constructed faster than optimal plain histograms, and, moreover, achieve higher synopsis quality with highly discontinuous data sets; such an advantage of a hierarchical synopsis structure over a histogram had been intuitively expressed, but never experimentally verified; and (iii) we show that Haar+ synopsis quality supersedes that of a CHH.

...read moreread less

30 citations

Anonymity in Unstructured Data

[...]

Manolis Terrovitis¹, Nikos Mamoulis¹, Panos Kalnis²•Institutions (2)

University of Hong Kong¹, National University of Singapore²

01 Jan 2008

TL;DR: A new version of the k-anonymity guarantee is defined, the k m -anonymousity, to limit the effects of the data dimensionality and an algorithm which finds the optimal solution is developed, at a high cost which makes it inapplicable for large, realistic problems.

...read moreread less

Abstract: In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the k m -anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets.

...read moreread less

17 citations

Journal Article•DOI•

Computation and Monitoring of Exclusive Closest Pairs

[...]

Leong Hou U¹, Nikos Mamoulis¹, Man Lung Yiu²•Institutions (2)

University of Hong Kong¹, Aalborg University²

01 Dec 2008-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper introduces the problem and proposes several solutions that solve it in main-memory, exploiting space partitioning, and studies an extended form of the query, where objects in one of the two joined sets have a capacity constraint, allowing them to match with multiple objects from the other set.

...read moreread less

Abstract: Given two datasets A and B, their exclusive closest pairs (ECP) join is a one-to-one assignment of objects from the two datasets, such that (i) the closest pair (a,b) in A times B is in the result and (ii) the remaining pairs are determined by removing objects a,b from A,B respectively, and recursively searching for the next closest pair. A real application of exclusive closest pairs is the computation of (car, parking slot) assignments. This paper introduces the problem and proposes several solutions that solve it in main-memory, exploiting space partitioning. In addition, we define a dynamic version of the problem, where the objective is to continuously monitor the ECP join solution, in an environment where the joined datasets change positions and content. Finally, we study an extended form of the query, where objects in one of the two joined sets (e.g., parking slots) have a capacity constraint, allowing them to match with multiple objects from the other set (e.g., cars). We show how our techniques can be extended for this variant and compare them with a previous solution to this problem. Experimental results on a system prototype demonstrate the efficiency and applicability of the proposed algorithms.

...read moreread less

14 citations

Proceedings Article•DOI•

Common Influence Join: A Natural Join Operation for Spatial Pointsets

[...]

Man Lung Yiu¹, Nikos Mamoulis², Panagiotis Karras³•Institutions (3)

Aalborg University¹, University of Hong Kong², University of Zurich³

07 Apr 2008

TL;DR: The experimental results show that a non-blocking algorithm, which computes intersecting pairs of Voronoi cells on-demand, is very efficient in practice, incurring only slightly higher I/O cost than the theoretical lower bound cost for the problem.

...read moreread less

Abstract: We identify and formalize a novel join operator for two spatial pointsets P and Q. The common influence join (CIJ) returns the pairs of points (p,q),p isin P,q isin Q, such that there exists a location in space, being closer to p than to any other point in P and at the same time closer to q than to any other point in Q. In contrast to existing join operators between pointsets (i.e., e-distance joins and fc-closest pairs), CIJ is parameter- free, providing a natural join result that finds application in marketing and decision support. We propose algorithms for the efficient evaluation of CIJ, for pointsets indexed by hierarchical multi-dimensional indexes. We validate the effectiveness and the efficiency of these methods via experimentation with synthetic and real spatial datasets. The experimental results show that a non-blocking algorithm, which computes intersecting pairs of Voronoi cells on-demand, is very efficient in practice, incurring only slightly higher I/O cost than the theoretical lower bound cost for the problem.

...read moreread less

13 citations

Proceedings Article•

Nearest Neighbor Queries in Network Databases.

[...]

Dimitris Papadias, Man Lung Yiu, Nikos Mamoulis¹, Yufei Tao•Institutions (1)

University of Hong Kong¹

01 Jan 2008

8 citations

Proceedings Article•DOI•

Lattice Histograms: a Resilient Synopsis Structure

[...]

Panagiotis Karras¹, Nikos Mamoulis²•Institutions (2)

University of Zurich¹, University of Hong Kong²

07 Apr 2008

TL;DR: The lattice histogram is introduced: a novel data reduction method that discovers and exploits any arbitrary hierarchy in the data, and achieves approximation quality provably at least as high as an optimal histogram for any data reduction problem.

...read moreread less

Abstract: Despite the surge of interest in data reduction techniques over the past years, no method has been proposed to date that can always achieve approximation quality preferable to that of the optimal plain histogram for a target error metric. In this paper, we introduce the lattice histogram: a novel data reduction method that discovers and exploits any arbitrary hierarchy in the data, and achieves approximation quality provably at least as high as an optimal histogram for any data reduction problem. We formulate LH construction techniques with approximation guarantees for general error metrics. We show that the case of minimizing a maximum-error metric can be solved by a specialized, memory-sparing approach; we exploit this solution to design reduced-space heuristics for the general- error case. We develop a mixed synopsis approach, applicable to the space-efficient high-quality summarization of very large data sets. We experimentally corroborate the superiority of LHs in approximation quality over previous techniques with representative error metrics and diverse data sets.

...read moreread less

7 citations

Journal Article•DOI•

Extracting k most important groups from data efficiently

[...]

Man Lung Yiu¹, Nikos Mamoulis², Vagelis Hristidis³•Institutions (3)

Aalborg University¹, University of Hong Kong², Florida International University³

01 Aug 2008

TL;DR: This work designs the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes and develops the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions.

...read moreread less

Abstract: We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is ''find the 10 combinations of product-type and month with the largest sum of sales''. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.

...read moreread less

Proceedings of the 20th international conference on Scientific and Statistical Database Management

[...]

Bertram Ludäscher, Nikos Mamoulis

09 Jul 2008

Proceedings Article•DOI•

Ring-constrained join: deriving fair middleman locations from pointsets via a geometric constraint

[...]

Man Lung Yiu¹, Panagiotis Karras², Nikos Mamoulis³•Institutions (3)

Aalborg University¹, University of Zurich², University of Hong Kong³

25 Mar 2008

TL;DR: Efficient R-tree based algorithms for computing ring-constrained join are developed, by exploiting the characteristics of the geometric constraint, and the results show that the proposed algorithms scale well with the data size and have robust performance across different data distributions.

...read moreread less

Abstract: We introduce a novel spatial join operator, the ring-constrained join (RCJ) Given two sets P and Q of spatial points, the result of RCJ consists of pairs (p, q) (where p e P, q e Q) satisfying an intuitive geometric constraint: the smallest circle enclosing p and q contains no other points in P, Q This new operation has important applications in decision support, eg, placing recycling stations at fair locations between restaurants and residential complexes Clearly, RCJ is defined based on a geometric constraint but not on distances between points Thus, our operation is fundamentally different from the conventional distance joins and closest pairs problems We are not aware of efficient processing algorithms for RCJ in the literature A brute-force solution requires computational cost quadratic to input size and it does not scale well for large datasets In view of this, we develop efficient R-tree based algorithms for computing RCJ, by exploiting the characteristics of the geometric constraint We evaluate experimentally the efficiency of our methods on synthetic and real spatial datasets The results show that our proposed algorithms scale well with the data size and have robust performance across different data distributions

...read moreread less

Proceedings Article•

Co-location Patterns, Algorithms.

[...]

Nikos Mamoulis

01 Jan 2008

Book•DOI•

Scientific and Statistical Database Management - 20th International Conference, SSDBM 2008, Proceedings: Preface

[...]

Bertram Ludäscher¹, Nikos Mamoulis²•Institutions (2)

University of California, Davis¹, University of Hong Kong²

14 Aug 2008

TL;DR: In this case, more books you read more knowledge you know, but it can mean also the bore is full as discussed by the authors. But this is some of how reading will give you the kindness.

...read moreread less

Abstract: We may not be able to make you love reading, but scientific and statistical database management will lead you to love reading starting from now. Book is the window to open the new world. The world that you want is in the better stage and level. World will always guide you to even the prestige stage of the life. You know, this is some of how reading will give you the kindness. In this case, more books you read more knowledge you know, but it can mean also the bore is full.

...read moreread less