scispace - formally typeset
Search or ask a question

Showing papers by "Nikos Mamoulis published in 2010"


Proceedings ArticleDOI
06 Jun 2010
TL;DR: This work provides a methodology for verifying whether a non-homogeneous generalization violates k-anonymity, and proposes a randomization method that prevents this type of attack and shows that k-Anonymity is not compromised by it.
Abstract: Most previous research on privacy-preserving data publishing, based on the k-anonymity model, has followed the simplistic approach of homogeneously giving the same generalized value in all quasi-identifiers within a partition. We observe that the anonymization error can be reduced if we follow a non-homogeneous generalization approach for groups of size larger than k. Such an approach would allow tuples within a partition to take different generalized quasi-identifier values. Anonymization following this model is not trivial, as its direct application can easily violate k-anonymity. In addition, non-homogeneous generalization allows for additional types of attack, which should be considered in the process. We provide a methodology for verifying whether a non-homogeneous generalization violates k-anonymity. Then, we propose a technique that generates a non-homogeneous generalization for a partition and show that its result satisfies k-anonymity, however by straightforwardly applying it, privacy can be compromised if the attacker knows the anonymization algorithm. Based on this, we propose a randomization method that prevents this type of attack and show that k-anonymity is not compromised by it. Nonhomogeneous generalization can be used on top of any existing partitioning approach to improve its utility. In addition, we show that a new partitioning technique tailored for non-homogeneous generalization can further improve quality. A thorough experimental evaluation demonstrates that our methodology greatly improves the utility of anonymized data in practice.

76 citations


Journal ArticleDOI
TL;DR: A scalable approach for probabilistic top-k similarity ranking on uncertain vector data that reduces to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to a reference object.
Abstract: This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that is assumed to be mutually exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several state-of-the-art probabilistic ranking models. Existing approaches compute this probability distribution by applying the Poisson binomial recurrence technique of quadratic complexity. In this paper, we theoretically as well as experimentally show that our framework reduces this to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic top-k ranking for the objects, according to different state-of-the-art definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach.

41 citations


Journal ArticleDOI
TL;DR: Efficient algorithms for optimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem are proposed that provide a tunable trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees.
Abstract: Consider a set of customers (e.g., WiFi receivers) and a set of service providers (e.g., wireless access points), where each provider has a capacity and the quality of service offered to its customers is anti-proportional to their distance. The Capacity Constrained Assignment (CCA) is a matching between the two sets such that (i) each customer is assigned to at most one provider, (ii) every provider serves no more customers than its capacity, (iii) the maximum possible number of customers are served, and (iv) the sum of Euclidean distances within the assigned provider-customer pairs is minimized. Although max-flow algorithms are applicable to this problem, they require the complete distance-based bipartite graph between the customer and provider sets. For large spatial datasets, this graph is expensive to compute and it may be too large to fit in main memory. Motivated by this fact, we propose efficient algorithms for optimal assignment that employ novel edge-pruning strategies, based on the spatial properties of the problem. Additionally, we develop incremental techniques that maintain an optimal assignment (in the presence of updates) with a processing cost several times lower than CCA recomputation from scratch. Finally, we present approximate (i.e., suboptimal) CCA solutions that provide a tunable trade-off between result accuracy and computation cost, abiding by theoretical quality guarantees. A thorough experimental evaluation demonstrates the efficiency and practicality of the proposed techniques.

36 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: A new ranking problem in versioned databases of versioned objects which have different valid instances along a history is proposed and a special indexing technique for archived data is proposed, based on a shared execution paradigm and more efficient than the first approach.
Abstract: We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.

31 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: This work develops preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus of New York Times news articles.
Abstract: Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.

30 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: Two new methods for skyline evaluation in multidimensional data with totally ordered attribute domains are proposed, inspired by the lattice theorem and an off-the-shelf skyline algorithm, which are up to an order of magnitude more efficient than previous work and scale well with different problem parameters.
Abstract: Although there has been a considerable body of work on skyline evaluation in multidimensional data with totally ordered attribute domains, there are only a few methods that consider attributes with partially ordered domains. Existing work maps each partially ordered domain to a total order and then adapts algorithms for totally-ordered domains to solve the problem. Nevertheless these methods either use stronger notions of dominance, which generate false positives, or require expensive dominance checks. In this paper, we propose two new methods, which do not have these drawbacks. The first method uses an appropriate mapping of a partial order to a total order, inspired by the lattice theorem and an off-the-shelf skyline algorithm. The second technique uses an appropriate storage and indexing approach, inspired by column stores, which enables efficient verification of whether a pair of objects are incompatible. We demonstrate that both our methods are up to an order of magnitude more efficient than previous work and scale well with different problem parameters, such as complexity of partial orders.

25 citations


Book ChapterDOI
13 Dec 2010
TL;DR: This paper proposes a new AC-OT construction secure in the standard model that supports policy in disjunctive form directly, without the duplication issue in the previous construction.
Abstract: Oblivious Transfer with Access Control (AC-OT) is a protocol which allows a user to obtain a database record with a credential satisfying the access policy of the record while the database server learns nothing about the record or the credential. The only AC-OT construction that supports policy in disjunctive form requires duplication of records in the database, each with a different conjunction of attributes (representing one possible criterion for accessing the record). In this paper, we propose a new AC-OT construction secure in the standard model. It supports policy in disjunctive form directly, without the above duplication issue. Due to the duplication issue in the previous construction, the size of an encrypted record is in O(Πi=1t ni) for a CNF policy (A1,1 ∨ ... ∨ A1,n1) ∧ ... ∧ (At,1 ∨ ... ∨ At,nt) and in O((kn)) for a k-of-n threshold gate. In our construction, the encrypted record size can be reduced to O(Σi=1t ni) for CNF form and O(n) for threshold case.

23 citations


Proceedings ArticleDOI
23 May 2010
TL;DR: To the knowledge, this is the first optimal and distributed algorithm to solve the 1-median (Fermat node) problem and saves 30%-85% of the energy compared to previously proposed techniques.
Abstract: We present an optimal distributed algorithm to adapt the placement of a single operator in high communication cost networks, such as a wireless sensor network. Our parameter-free algorithm finds the optimal node to host the operator with minimum communication cost overhead. Three techniques, proposed here, make this feature possible: 1) identifying the special, and most frequent case, where no flooding is needed, otherwise 2) limitation of the neighborhood to be flooded and 3) variable speed flooding and eves-dropping. When no flooding is needed the communication cost overhead for adapting the operator placement is negligible. In addition, our algorithm does not require any extra communication cost while the query is executed. In our experiments we show that for the rest of cases our algorithm saves 30%-85% of the energy compared to previously proposed techniques. To our knowledge this is the first optimal and distributed algorithm to solve the 1-median (Fermat node) problem.

15 citations


Book ChapterDOI
15 Dec 2010
TL;DR: In this paper, the problem of predicate encryption was considered and the size of the ciphertext and the token were both O(m) and O(n) for the equality and inequality versions of the problem, respectively.
Abstract: In this paper, we consider the problem of predicate encryption and focus on the predicate for testing whether the Hamming distance between the attribute X of a data item and a target V is equal to (or less than) a threshold t where X and V are of length m. Existing solutions either do not provide attribute protection or produce a big ciphertext of size O(2 m ). For the equality version of the problem, we provide a scheme which is match-concealing (MC) secure and the sizes of the ciphertext and token are both O(m). For the inequality version of the problem, we give a practical scheme, also achieving MC security, which produces a ciphertext with size \(O(m^{t_{max}})\) if the maximum value of t, t max , is known in advance and is a constant. We also show how to update the ciphertext if the user wants to increase t max without constructing the ciphertext from scratch.

14 citations


Journal ArticleDOI
01 Apr 2010
TL;DR: This paper considers the continuous assignment problem (CAP), where an optimal assignment must be constantly maintained between mobile users and a set of servers and proposes an algorithm that utilizes the geometric characteristics of the problem and significantly accelerates the initial assignment computation and its subsequent maintenance.
Abstract: Consider a set of servers and a set of users, where each server has a coverage region (i.e., an area of service) and a capacity (i.e., a maximum number of users it can serve). Our task is to assign every user to one server subject to the coverage and capacity constraints. To offer the highest quality of service, we wish to minimize the average distance between users and their assigned server. This is an instance of a well-studied problem in operations research, termed optimal assignment. Even though there exist several solutions for the static case (where user locations are fixed), there is currently no method for dynamic settings. In this paper, we consider the continuous assignment problem (CAP), where an optimal assignment must be constantly maintained between mobile users and a set of servers. The fact that the users are mobile necessitates real-time reassignment so that the quality of service remains high (i.e., their distance from their assigned servers is minimized). The large scale and the time-critical nature of targeted applications require fast CAP solutions. We propose an algorithm that utilizes the geometric characteristics of the problem and significantly accelerates the initial assignment computation and its subsequent maintenance. Our method applies to different cost functions (e.g., average squared distance) and to any Minkowski distance metric (e.g., Euclidean, L 1 norm, etc.).

11 citations


Journal ArticleDOI
TL;DR: This paper model the structural alignment of proteins as a combinatorial problem, and proposes a data-mining approach that considers each bin as a coincidence group and mine for frequent patterns, which is a well-studied technique in data mining.
Abstract: Comparing the 3D structures of proteins is an important but computationally hard problem in bioinformatics. In this paper, we propose studying the problem when much less information or assumptions are available. We model the structural alignment of proteins as a combinatorial problem. In the problem, each protein is simply a set of points in the 3D space, without sequence order information, and the objective is to discover all large enough alignments for any subset of the input. We propose a data-mining approach for this problem. We first perform geometric hashing of the structures such that points with similar locations in the 3D space are hashed into the same bin in the hash table. The novelty is that we consider each bin as a coincidence group and mine for frequent patterns, which is a well-studied technique in data mining. We observe that these frequent patterns are already potentially large alignments. Then a simple heuristic is used to extend the alignments if possible. We implemented the algorithm and tested it using real protein structures. The results were compared with existing tools. They showed that the algorithm is capable of finding conserved substructures that do not preserve sequence order, especially those existing in protein interfaces. The algorithm can also identify conserved substructures of functionally similar structures within a mixture with dissimilar ones. The running time of the program was smaller or comparable to that of the existing tools.