scispace - formally typeset
Search or ask a question

Showing papers by "Nikos Mamoulis published in 2008"


Journal ArticleDOI
01 Aug 2008
TL;DR: A new version of the k-anonymity guarantee is defined, the km-Anonymity, to limit the effects of the data dimensionality and two efficient algorithms to transform the database are proposed.
Abstract: In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the km-anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets.

324 citations


Proceedings ArticleDOI
27 Apr 2008
TL;DR: It is shown that one can use partial trajectory knowledge as a quasi-identifier for the remaining locations in the sequence and device a data suppression technique, which prevents this type of breach, while keeping the posted data as accurate as possible.
Abstract: We study the problem of protecting privacy in the publication of location sequences. Consider a database of trajectories, corresponding to movements of people, captured by their transactions when they use credit or RFID debit cards. We show that, if such trajectories are published exactly (by only hiding the identities of persons that followed them), there is a high risk of privacy breach by adversaries who hold partial information about them (e.g., shop owners). In particular, we show that one can use partial trajectory knowledge as a quasi-identifier for the remaining locations in the sequence. We device a data suppression technique, which prevents this type of breach, while keeping the posted data as accurate as possible.

306 citations


Journal ArticleDOI
01 May 2008
TL;DR: It is shown, with theoretical evidence, that the Bdual-tree indeed outperforms the Bx-tree in most circum- stances, and the technique can effectively answer progressive spatiotemporal queries, which are poorly supported by BX-trees.
Abstract: Existing spatiotemporal indexes suffer from either large update cost or poor query performance, except for the B x -tree (the state-of-the-art), which consists of multiple B +-trees indexing the 1D values transformed from the (multi-dimensional) moving objects based on a space filling curve (Hilbert, in particular). This curve, however, does not consider object velocities, and as a result, query processing with a B x -tree retrieves a large number of false hits, which seriously compromises its efficiency. It is natural to wonder "can we obtain better performance by capturing also the velocity information, using a Hilbert curve of a higher dimensionality?". This paper provides a positive answer by developing the B dual -tree, a novel spatiotemporal access method leveraging pure relational methodology. We show, with theoretical evidence, that the B dual -tree indeed outperforms the B x -tree in most circum- stances. Furthermore, our technique can effectively answer progressive spatiotemporal queries, which are poorly supported by B x -trees.

118 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: Efficient algorithms foroptimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem are proposed that provide a trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees.
Abstract: Given a point set P of customers (e.g., WiFi receivers) and a point set Q of service providers (e.g., wireless access points), where each q ∈ Q has a capacity q.k, the capacity constrained assignment (CCA) is a matching M ⊆ Q × P such that (i) each point q ∈ Q (p ∈ P) appears at most k times (at most once) in M, (ii) the size of M is maximized (i.e., it comprises min{|P|, ∑q∈Qq.k} pairs), and (iii) the total assignment cost (i.e., the sum of Euclidean distances within all pairs) is minimized. Thus, the CCA problem is to identify the assignment with the optimal overall quality; intuitively, the quality of q's service to p in a given (q, p) pair is anti-proportional to their distance. Although max-flow algorithms are applicable to this problem, they require the complete distance-based bipartite graph between Q and P. For large spatial datasets, this graph is expensive to compute and it may be too large to fit in main memory. Motivated by this fact, we propose efficient algorithms for optimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem. Additionally, we develop approximate (i.e., suboptimal) CCA solutions that provide a trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees. A thorough experimental evaluation demonstrates the efficiency and practicality of the proposed techniques.

62 citations


Journal ArticleDOI
TL;DR: This article formally proves that the CHH, in its default binary-hierarchy form, is a simplified variant of a Haar+ tree, and confirms the theoretically expected superiority of Haar+.
Abstract: Hierarchical synopsis structures offer a viable alternative in terms of efficiency and flexibility in relation to traditional summarization techniques such as histograms. Previous research on such structures has mostly focused on a single model, based on the Haar wavelet decomposition. In previous work, we have introduced a more refined, wavelet-inspired hierarchical index structure for synopsis construction: the Haar+ tree. The chief advantages of this structure are twofold. First, it achieves higher synopsis quality at the task of summarizing data sets with sharp discontinuities than state-of-the-art histogram and Haar wavelet techniques. Second, thanks to its search space delimitation capacity, Haar+ synopsis construction operates in time linear in the size of the data set for any monotonic distributive error metric. Contemporaneous research has introduced another hierarchical synopsis structure, the compact hierarchical histogram (CHH). In this article, we elaborate on both these structures. First, we formally prove that the CHH, in its default binary-hierarchy form, is a simplified variant of a Haar+ tree. We then focus on the summarization problem, with both these hierarchical synopsis structures, in which an error guarantee expressed by a maximum-error metric is required. We show that this problem is most efficiently solved through its dual, space-minimization counterpart, which can also achieve optimal quality. In this case, there is a benefit to be gained by specializing the algorithm for each structure; hence, our algorithm for optimal-quality maximum-error CHH requires low polynomial time; on the other hand, optimal-quality Haar+ synopses for maximum-error metrics are constructed in exponential time; hence, we also develop a low-polynomial-time approximation scheme for the maximum-error Haar+ case. Furthermore, we extend our approach for both general-error and maximum-error Haar+ synopses to arbitrary dimensionality. In our experimental study, (i) we confirm the theoretically expected superiority of Haar+ synopses over Haar wavelet methods in both construction time and achieved quality for representative error metrics; (ii) we demonstrate that Haar+ synopses are also constructed faster than optimal plain histograms, and, moreover, achieve higher synopsis quality with highly discontinuous data sets; such an advantage of a hierarchical synopsis structure over a histogram had been intuitively expressed, but never experimentally verified; and (iii) we show that Haar+ synopsis quality supersedes that of a CHH.

30 citations


01 Jan 2008
TL;DR: A new version of the k-anonymity guarantee is defined, the k m -anonymousity, to limit the effects of the data dimensionality and an algorithm which finds the optimal solution is developed, at a high cost which makes it inapplicable for large, realistic problems.
Abstract: In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the k m -anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets.

17 citations


Journal ArticleDOI
TL;DR: This paper introduces the problem and proposes several solutions that solve it in main-memory, exploiting space partitioning, and studies an extended form of the query, where objects in one of the two joined sets have a capacity constraint, allowing them to match with multiple objects from the other set.
Abstract: Given two datasets A and B, their exclusive closest pairs (ECP) join is a one-to-one assignment of objects from the two datasets, such that (i) the closest pair (a,b) in A times B is in the result and (ii) the remaining pairs are determined by removing objects a,b from A,B respectively, and recursively searching for the next closest pair. A real application of exclusive closest pairs is the computation of (car, parking slot) assignments. This paper introduces the problem and proposes several solutions that solve it in main-memory, exploiting space partitioning. In addition, we define a dynamic version of the problem, where the objective is to continuously monitor the ECP join solution, in an environment where the joined datasets change positions and content. Finally, we study an extended form of the query, where objects in one of the two joined sets (e.g., parking slots) have a capacity constraint, allowing them to match with multiple objects from the other set (e.g., cars). We show how our techniques can be extended for this variant and compare them with a previous solution to this problem. Experimental results on a system prototype demonstrate the efficiency and applicability of the proposed algorithms.

14 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: The experimental results show that a non-blocking algorithm, which computes intersecting pairs of Voronoi cells on-demand, is very efficient in practice, incurring only slightly higher I/O cost than the theoretical lower bound cost for the problem.
Abstract: We identify and formalize a novel join operator for two spatial pointsets P and Q. The common influence join (CIJ) returns the pairs of points (p,q),p isin P,q isin Q, such that there exists a location in space, being closer to p than to any other point in P and at the same time closer to q than to any other point in Q. In contrast to existing join operators between pointsets (i.e., e-distance joins and fc-closest pairs), CIJ is parameter- free, providing a natural join result that finds application in marketing and decision support. We propose algorithms for the efficient evaluation of CIJ, for pointsets indexed by hierarchical multi-dimensional indexes. We validate the effectiveness and the efficiency of these methods via experimentation with synthetic and real spatial datasets. The experimental results show that a non-blocking algorithm, which computes intersecting pairs of Voronoi cells on-demand, is very efficient in practice, incurring only slightly higher I/O cost than the theoretical lower bound cost for the problem.

13 citations


Proceedings Article
01 Jan 2008

8 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: The lattice histogram is introduced: a novel data reduction method that discovers and exploits any arbitrary hierarchy in the data, and achieves approximation quality provably at least as high as an optimal histogram for any data reduction problem.
Abstract: Despite the surge of interest in data reduction techniques over the past years, no method has been proposed to date that can always achieve approximation quality preferable to that of the optimal plain histogram for a target error metric. In this paper, we introduce the lattice histogram: a novel data reduction method that discovers and exploits any arbitrary hierarchy in the data, and achieves approximation quality provably at least as high as an optimal histogram for any data reduction problem. We formulate LH construction techniques with approximation guarantees for general error metrics. We show that the case of minimizing a maximum-error metric can be solved by a specialized, memory-sparing approach; we exploit this solution to design reduced-space heuristics for the general- error case. We develop a mixed synopsis approach, applicable to the space-efficient high-quality summarization of very large data sets. We experimentally corroborate the superiority of LHs in approximation quality over previous techniques with representative error metrics and diverse data sets.

7 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This work designs the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes and develops the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions.
Abstract: We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is ''find the 10 combinations of product-type and month with the largest sum of sales''. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.


Proceedings ArticleDOI
25 Mar 2008
TL;DR: Efficient R-tree based algorithms for computing ring-constrained join are developed, by exploiting the characteristics of the geometric constraint, and the results show that the proposed algorithms scale well with the data size and have robust performance across different data distributions.
Abstract: We introduce a novel spatial join operator, the ring-constrained join (RCJ) Given two sets P and Q of spatial points, the result of RCJ consists of pairs (p, q) (where p e P, q e Q) satisfying an intuitive geometric constraint: the smallest circle enclosing p and q contains no other points in P, Q This new operation has important applications in decision support, eg, placing recycling stations at fair locations between restaurants and residential complexes Clearly, RCJ is defined based on a geometric constraint but not on distances between points Thus, our operation is fundamentally different from the conventional distance joins and closest pairs problems We are not aware of efficient processing algorithms for RCJ in the literature A brute-force solution requires computational cost quadratic to input size and it does not scale well for large datasets In view of this, we develop efficient R-tree based algorithms for computing RCJ, by exploiting the characteristics of the geometric constraint We evaluate experimentally the efficiency of our methods on synthetic and real spatial datasets The results show that our proposed algorithms scale well with the data size and have robust performance across different data distributions

Proceedings Article
01 Jan 2008

BookDOI
14 Aug 2008
TL;DR: In this case, more books you read more knowledge you know, but it can mean also the bore is full as discussed by the authors. But this is some of how reading will give you the kindness.
Abstract: We may not be able to make you love reading, but scientific and statistical database management will lead you to love reading starting from now. Book is the window to open the new world. The world that you want is in the better stage and level. World will always guide you to even the prestige stage of the life. You know, this is some of how reading will give you the kindness. In this case, more books you read more knowledge you know, but it can mean also the bore is full.

Proceedings Article
01 Jan 2008