scispace - formally typeset
Search or ask a question

Showing papers by "Aristides Gionis published in 2016"


Proceedings ArticleDOI
08 Feb 2016
TL;DR: This paper performs a systematic methodological study of controversy detection using social media network structure and content, and finds that a new random-walk-based measure outperforms existing ones in capturing the intuitive notion of controversy.
Abstract: Which topics spark the most heated debates in social media? Identifying these topics is a first step towards creating systems which pierce echo chambers. In this paper, we perform a systematic methodological study of controversy detection using social media network structure and content. Unlike previous work, rather than identifying controversy in a single hand-picked topic and use domain-specific knowledge, we focus on comparing topics in any domain. Our approach to quantifying controversy is a graph-based three-stage pipeline, which involves (i) building a conversation graph about a topic, which represents alignment of opinion among users; (ii) partitioning the conversation graph to identify potential sides of the controversy; and (iii)measuring the amount of controversy from characteristics of the~graph. We perform an extensive comparison of controversy measures, as well as graph building approaches and data sources. We use both controversial and non-controversial topics on Twitter, as well as other external datasets. We find that our new random-walk-based measure outperforms existing ones in capturing the intuitive notion of controversy, and show that content features are vastly less helpful in this task.

123 citations


Journal ArticleDOI
TL;DR: This paper reformulates the problem definition in a way that it is able to obtain an algorithm with constant-factor approximation guarantee, and presents a new approach that improves over the existing techniques, both in theory and practice.
Abstract: Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top-kdensest subgraphs. One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top-koverlapping densest subgraphs, and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee. Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency.

56 citations


Proceedings ArticleDOI
13 Aug 2016
TL;DR: In this paper, a prediction model was developed to identify the next line of existing lyrics from a set of candidate next lines, and then they employed the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning.
Abstract: Writing rap lyrics requires both creativity to construct a meaningful, interesting story and lyrical skills to produce complex rhyme patterns, which form the cornerstone of good flow. We present a rap lyrics generation method that captures both of these aspects. First, we develop a prediction model to identify the next line of existing lyrics from a set of candidate next lines. This model is based on two machine-learning techniques: the RankSVM algorithm and a deep neural network model with a novel structure. Results show that the prediction model can identify the true next line among 299 randomly selected lines with an accuracy of 17%, i.e., over 50 times more likely than by random. Second, we employ the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning. An evaluation of the produced lyrics shows that in terms of quantitative rhyme density, the method outperforms the best human rappers by 21%. The rap lyrics generator has been deployed as an online tool called DeepBeat, and the performance of the tool has been assessed by analyzing its usage logs. This analysis shows that machine-learned rankings correlate with user preferences.

55 citations


Proceedings ArticleDOI
13 Aug 2016
TL;DR: CulT is developed, a scalable and effective algorithm to reconstruct epidemics that is also suited for online settings, by formulating the problem as that of a temporal Steiner-tree computation, for which a fast algorithm leveraging the specific problem structure is designed.
Abstract: We consider the problem of reconstructing an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. Our input consists of a temporal network, which contains information about when two nodes interacted, and a sample of nodes that have been reported as infected. The goal is to recover the flow of the spread, including discovering the starting nodes, and identifying other likely-infected nodes that are not reported. The problem we consider has multiple applications, from public health to social media and viral marketing purposes. Previous work explicitly factor-in many unrealistic assumptions: it is assumed that (a) the underlying network does not change;(b) we have access to perfect noise-free data; or (c) we know the exact propagation model. In contrast, we avoid these simplifications: we take into account the temporal network, we require only a small sample of reported infections, and we do not make any restrictive assumptions about the propagation model. We develop CulT, a scalable and effective algorithm to reconstruct epidemics that is also suited for online settings. CulT works by formulating the problem as that of a temporal Steiner-tree computation, for which we design a fast algorithm leveraging the specific problem structure. We demonstrate the efficacy of the proposed approach through extensive experiments on diverse datasets.

52 citations


Posted Content
TL;DR: Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.
Abstract: We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

22 citations


Journal ArticleDOI
01 Jun 2016
TL;DR: In this article, the authors introduce the notion of time-dependent similarity, where the similarity of two items decreases with the difference in their arrival time, and design two algorithmic frameworks to solve the similarity self-join in a streaming context.
Abstract: We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time.By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the SSSJ problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case.Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

21 citations


Posted Content
TL;DR: The main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence subject to a set of relative-distance constraints, which can be employed by many different kernel methods in a wide range of applications.
Abstract: We consider the problem of metric learning subject to a set of constraints on relative-distance comparisons between the data items. Such constraints are meant to reflect side-information that is not expressed directly in the feature vectors of the data items. The relative-distance constraints used in this work are particularly effective in expressing structures at finer level of detail than must-link (ML) and cannot-link (CL) constraints, which are most commonly used for semi-supervised clustering. Relative-distance constraints are thus useful in settings where providing an ML or a CL constraint is difficult because the granularity of the true clustering is unknown. Our main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence --- a variant of the Bregman divergence --- subject to a set of relative-distance constraints. The learned kernel matrix can then be employed by many different kernel methods in a wide range of applications. In our experimental evaluations, we consider a semi-supervised clustering setting and show empirically that kernels found by our algorithm yield clusterings of higher quality than existing approaches that either use ML/CL constraints or a different means to implement the supervision using relative comparisons.

10 citations


Proceedings ArticleDOI
27 Feb 2016
TL;DR: Among the topics discussed on social media, some topics spark more heated debate than others as discussed by the authors, and experience suggests that major political events such as a vote for healthcare law in the US would spark more debate be-tween opposing sides than other events, such as the concert of a popular music band.
Abstract: Among the topics discussed on social media, some spark more heated debate than others. For example, experience suggests that major political events, such as a vote for healthcare law in the US, would spark more debate be-tween opposing sides than other events, such as a concert of a popular music band. Exploring the topics of discussion on Twitter and understanding which ones are controver-sial is extremely useful for a variety of purposes, such as for journalists to understand what issues divide the public, or for social scientists to understand how controversy is manifested in social interactions.

9 citations


Posted Content
01 Nov 2016
TL;DR: This paper presents a simple model based on a recently-developed user-level controversy score, that is competitive with state-of-the-art link-prediction algorithms and proposes an efficient algorithm, which considers only a fraction of all the combinations of possible edges.

7 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: It is demonstrated that, when fixing the number of roles to be used, the proposed role-discovery problem is np-hard, while another (seemingly easier) version of the problem isnp-hard to approximate.
Abstract: We provide a new formulation for theproblem of role discovery in graphs. Our definition is structural:two vertices should be assigned to the same roleif the roles of their neighbors, when viewed as multi-sets, are similar enough. An attractive characteristic of our approachis that it is based on optimizing a well-defined objective function, and thus, contrary to previous approaches, the role-discovery task can be studied with the tools of combinatorial optimization. We demonstrate that, when fixing the number of roles to be used, the proposed role-discovery problem is np-hard, while another (seemingly easier) version of the problem is np-hard to approximate. On the positive side, despite the recursive nature of our objective function, we can show that finding a perfect (zero-cost) role assignmentwith the minimum number of roles can be solved in polynomial time. We do this by connecting the zero-cost role assignment with the notion of equitable partition. For the more practical version of the problem with fixed number of roleswe present two natural heuristic methods, and discuss how to make them scalable in large graphs.

7 citations


Journal ArticleDOI
TL;DR: The BackboneDiscovery problem is introduced, which encapsulates both functional and structural aspects of network analysis and is observed that for many real-world networks, the algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.
Abstract: We introduce a new computational problem, the BackboneDiscovery problem, which encapsulates both functional and structural aspects of network analysis. While the topology of a typical road network has been available for a long time (e.g., through maps), it is only recently that fine–granularity functional (activity and usage) information about the network (such as source–destination traffic information) is being collected and is readily available. The combination of functional and structural information provides an efficient way to explore and understand usage patterns of networks and aid in design and decision making. We propose efficient algorithms for the BackboneDiscovery problem including a novel use of edge centrality. We observe that for many real-world networks, our algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.

Book ChapterDOI
19 Sep 2016
TL;DR: The problem of mining online communication data and finding top-k temporal events is considered, which is a coherent topic that is discussed frequently in a relatively short time span, while its information flow respects the underlying network.
Abstract: With the increasing use of online communication platforms, such as email, Twitter, and messaging applications, we are faced with a growing amount of data that combine content (what is said), time (when), and user (by whom) information. Discovering meaningful patterns and understand what is happening in this data is an important challenge. We consider the problem of mining online communication data and finding top-\(k \) temporal events. A temporal event is a coherent topic that is discussed frequently in a relatively short time span, while its information flow respects the underlying network.

Book ChapterDOI
19 Apr 2016
TL;DR: This work proposes efficient algorithms for the BackboneDiscovery problem including a novel use of edge centrality and observes that for many real world networks, the algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.
Abstract: We introduce a new computational problem, the BackboneDiscovery problem, which encapsulates both functional and structural aspects of network analysis. While the topology of a typical road network has been available for a long time (e.g., through maps), it is only recently that fine-granularity functional (activity and usage) information about the network (like source-destination traffic information) is being collected and is readily available. The combination of functional and structural information provides an efficient way to explore and understand usage patterns of networks and aid in design and decision making. We propose efficient algorithms for the BackboneDiscovery problem including a novel use of edge centrality. We observe that for many real world networks, our algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.

Posted Content
TL;DR: In this article, the authors proposed an edge recommendation algorithm to reduce controversy in an endorsement graph, which considers only a fraction of all the combinations of possible edges and considers the acceptance probability of the recommended edge, which represents how likely the edge is to materialize in the graph.
Abstract: Society is often polarized by controversial issues, that split the population into groups of opposing views. When such issues emerge on social media, we often observe the creation of 'echo chambers', i.e., situations where like-minded people reinforce each other's opinion, but do not get exposed to the views of the opposing side. In this paper we study algorithmic techniques for bridging these chambers, and thus, reducing controversy. Specifically, we represent the discussion on a controversial issue with an endorsement graph, and cast our problem as an edge-recommendation problem on this graph. The goal of the recommendation is to reduce the controversy score of the graph, which is measured by a recently-developed metric based on random walks. At the same time, we take into account the acceptance probability of the recommended edge, which represents how likely the edge is to materialize in the endorsement graph. We propose a simple model based on a recently-developed user-level controversy score, that is competitive with state-of- the-art link-prediction algorithms. We thus aim at finding the edges that produce the largest reduction in the controversy score, in expectation. To solve this problem, we propose an efficient algorithm, which considers only a fraction of all the combinations of possible edges. Experimental results show that our algorithm is more efficient than a simple greedy heuristic, while producing comparable score reduction. Finally, a comparison with other state-of-the-art edge-addition algorithms shows that this problem is fundamentally different from what has been studied in the literature.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: This paper mine personal communication data with the goal of generating skill endorsements of the type "person A endorses person B on skill X" and studies two different approaches, one based on building a skill graph, and onebased on information retrieval techniques.
Abstract: People are increasingly communicating and collaborating via digital platforms, such as email and messaging applications. Data exchanged on these digital communication platforms can be a treasure trove of information on people who participate in the discussions: who they are collaborating with, what they are working on, what their expertise is, and so on. Yet, personal communication data is very rarely analyzed due to the sensitivity of the information it contains. In this paper, we mine personal communication data with the goal of generating skill endorsements of the type "person A endorses person B on skill X." To address privacy concerns, we consider that each person has access only to their own data (i.e., conversations with their peers). By using our method, they can generate endorsements for their peers, which they can inspect and opt to publish. To identify meaningful skills we use a knowledge base created from the StackExchange Q&A forum. We study two different approaches, one based on building a skill graph, and one based on information retrieval techniques. We find that the latter approach outperforms the graph-based algorithms when tested on a dataset of user profiles from StackOverflow. We also conduct a user study on email data from nine volunteers, and we find that the information retrieval-based approach achieves a MAP@10 score of 0.617.

Posted Content
TL;DR: In this paper, the authors introduce two relative query strategies, TopMatchings and GibbsMatchings, which can be applied on top of any network alignment method that constructs and solves a bipartite matching problem.
Abstract: Network alignment is the problem of matching the nodes of two graphs, maximizing the similarity of the matched nodes and the edges between them. This problem is encountered in a wide array of applications-from biological networks to social networks to ontologies-where multiple networked data sources need to be integrated. Due to the difficulty of the task, an accurate alignment can rarely be found without human assistance. Thus, it is of great practical importance to develop network alignment algorithms that can optimally leverage experts who are able to provide the correct alignment for a small number of nodes. Yet, only a handful of existing works address this active network alignment setting. The majority of the existing active methods focus on absolute queries ("are nodes $a$ and $b$ the same or not?"), whereas we argue that it is generally easier for a human expert to answer relative queries ("which node in the set $\{b_1, \ldots, b_n\}$ is the most similar to node $a$?"). This paper introduces two novel relative-query strategies, TopMatchings and GibbsMatchings, which can be applied on top of any network alignment method that constructs and solves a bipartite matching problem. Our methods identify the most informative nodes to query by sampling the matchings of the bipartite graph associated to the network-alignment instance. We compare the proposed approaches to several commonly-used query strategies and perform experiments on both synthetic and real-world datasets. Our sampling-based strategies yield the highest overall performance, outperforming all the baseline methods by more than 15 percentage points in some cases. In terms of accuracy, TopMatchings and GibbsMatchings perform comparably. However, GibbsMatchings is significantly more scalable, but it also requires hyperparameter tuning for a temperature parameter.

Posted Content
TL;DR: This paper studies the top-k densest subgraph problem in the sliding-window model and proposes an efficient fully-dynamic algorithm that profits from the observation that updates only affect a limited region of the graph.
Abstract: Given a large graph, the densest-subgraph problem asks to find a subgraph with maximum average degree. When considering the top-k version of this problem, a naive solution is to iteratively find the densest subgraph and remove it in each iteration. However, such a solution is impractical due to high processing cost. The problem is further complicated when dealing with dynamic graphs, since adding or removing an edge requires re-running the algorithm. In this paper, we study the top-k densest subgraph problem in the sliding-window model and propose an efficient fully-dynamic algorithm. The input of our algorithm consists of an edge stream, and the goal is to find the node-disjoint subgraphs that maximize the sum of their densities. In contrast to existing state-of-the-art solutions that require iterating over the entire graph upon any update, our algorithm profits from the observation that updates only affect a limited region of the graph. Therefore, the top-k densest subgraphs are maintained by only applying local updates. We provide a theoretical analysis of the proposed algorithm and show empirically that the algorithm often generates denser subgraphs than state-of-the-art competitors. Experiments show an improvement in efficiency of up to three to five orders of magnitude compared to state-of-the-art solutions.

Posted Content
TL;DR: In this paper, the problem of mining online communication data and finding top-k temporal events is considered, where a temporal event is defined as a coherent topic that is discussed frequently, in a relatively short time span, while the information of the event respects the underlying network structure.
Abstract: With the increasing use of online communication platforms, such as email, twitter, and messaging applications, we are faced with a growing amount of data that combine content (what is said), time (when), and user (by whom) information. An important computational challenge is to analyze these data, discover meaningful patterns, and understand what is happening. We consider the problem of mining online communication data and finding top-k temporal events. We define a temporal event to be a coherent topic that is discussed frequently, in a relatively short time span, while the information ow of the event respects the underlying network structure. We construct our model for detecting temporal events in two steps. We first introduce the notion of interaction meta-graph, which connects associated interactions. Using this notion, we define a temporal event to be a subset of interactions that (i) are topically and temporally close and (ii) correspond to a tree that captures the information ow. Finding the best temporal event leads to budget version of the prize-collecting Steiner-tree (PCST) problem, which we solve using three different methods: a greedy approach, a dynamic-programming algorithm, and an adaptation to an existing approximation algorithm. The problem of finding the top- k events among a set of candidate events maps to maximum set-cover problem, and thus, solved by greedy. We compare and analyze our algorithms in both synthetic and real datasets, such as twitter and email communication. The results show that our methods are able to detect meaningful temporal events.