Showing papers in "ACM Transactions on The Web in 2019"

PDF

Open Access

Journal Article•DOI•

Polarization and Fake News: Early Warning of Potential Misinformation Targets

[...]

Michela Del Vicario¹, Walter Quattrociocchi², Antonio Scala³, Fabiana Zollo²•Institutions (3)

IMT Institute for Advanced Studies Lucca¹, Ca' Foscari University of Venice², Sapienza University of Rome³

27 Mar 2019-ACM Transactions on The Web

TL;DR: In this article, the authors introduce a framework for promptly identifying polarizing content on social media and thus predicting future fake news topics, based on a series of characteristics related to users' behavior on online social media.

...read moreread less

Abstract: Users’ polarization and confirmation bias play a key role in misinformation spreading on online social media. Our aim is to use this information to determine in advance potential targets for hoaxes and fake news. In this article, we introduce a framework for promptly identifying polarizing content on social media and, thus, “predicting” future fake news topics. We validate the performances of the proposed methodology on a massive Italian Facebook dataset, showing that we are able to identify topics that are susceptible to misinformation with 77% accuracy. Moreover, such information may be embedded as a new feature in an additional classifier able to recognize fake news with 91% accuracy. The novelty of our approach consists in taking into account a series of characteristics related to users’ behavior on online social media such as Facebook, making a first, important step towards the mitigation of misinformation phenomena by supporting the identification of potential misinformation targets and thus the design of tailored counter-narratives.

...read moreread less

185 citations

Journal Article•DOI•

Detecting Cyberbullying and Cyberaggression in Social Media

[...]

Despoina Chatzakou, Ilias Leontiadis¹, Jeremy Blackburn², Emiliano De Cristofaro³, Gianluca Stringhini⁴, Athena Vakali⁵, Nicolas Kourtellis⁶ - Show less +3 more•Institutions (6)

Samsung¹, Binghamton University², University College London³, Boston University⁴, Aristotle University of Thessaloniki⁵, Telefónica⁶

14 Oct 2019-ACM Transactions on The Web

TL;DR: This work presents a robust methodology to distinguish bullies and aggressors from normal Twitter users by considering text, user, and network-based attributes, and discusses the current status of Twitter user accounts marked as abusive by the methodology and the performance of potential mechanisms that can be used by Twitter to suspend users in the future.

...read moreread less

Abstract: Cyberbullying and cyberaggression are increasingly worrisome phenomena affecting people across all demographics. More than half of young social media users worldwide have been exposed to such prolonged and/or coordinated digital harassment. Victims can experience a wide range of emotions, with negative consequences such as embarrassment, depression, isolation from other community members, which embed the risk to lead to even more critical consequences, such as suicide attempts.In this work, we take the first concrete steps to understand the characteristics of abusive behavior in Twitter, one of today’s largest social media platforms. We analyze 1.2 million users and 2.1 million tweets, comparing users participating in discussions around seemingly normal topics like the NBA, to those more likely to be hate-related, such as the Gamergate controversy, or the gender pay inequality at the BBC station. We also explore specific manifestations of abusive behavior, i.e., cyberbullying and cyberaggression, in one of the hate-related communities (Gamergate). We present a robust methodology to distinguish bullies and aggressors from normal Twitter users by considering text, user, and network-based attributes. Using various state-of-the-art machine-learning algorithms, we classify these accounts with over 90% accuracy and AUC. Finally, we discuss the current status of Twitter user accounts marked as abusive by our methodology and study the performance of potential mechanisms that can be used by Twitter to suspend users in the future.

...read moreread less

80 citations

Journal Article•DOI•

Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter

[...]

Stefano Cresci, Fabrizio Lillo¹, Daniele Regoli, Serena Tardelli, Maurizio Tesconi - Show less +1 more•Institutions (1)

University of Bologna¹

03 Apr 2019-ACM Transactions on The Web

TL;DR: In this article, the authors investigated the presence and impact of fake stock microblogs on the stock market and found that as much as 71% of the authors of suspicious financial tweets are classified as bots by a state-of-the-art spambot-detection algorithm.

...read moreread less

Abstract: Microblogs are increasingly exploited for predicting prices and traded volumes of stocks in financial markets. However, it has been demonstrated that much of the content shared in microblogging platforms is created and publicized by bots and spammers. Yet, the presence (or lack thereof) and the impact of fake stock microblogs has never been systematically investigated before. Here, we study 9M tweets related to stocks of the five main financial markets in the US. By comparing tweets with financial data from Google Finance, we highlight important characteristics of Twitter stock microblogs. More importantly, we uncover a malicious practice—referred to as cashtag piggybacking—perpetrated by coordinated groups of bots and likely aimed at promoting low-value stocks by exploiting the popularity of high-value ones. Among the findings of our study is that as much as 71% of the authors of suspicious financial tweets are classified as bots by a state-of-the-art spambot-detection algorithm. Furthermore, 37% of them were suspended by Twitter a few months after our investigation. Our results call for the adoption of spam- and bot-detection techniques in all studies and applications that exploit user-generated content for predicting the stock market.

...read moreread less

64 citations

Journal Article•DOI•

A Large-scale Behavioural Analysis of Bots and Humans on Twitter

[...]

Zafar Gilani¹, Reza Farahbakhsh², Gareth Tyson³, Jon Crowcroft¹•Institutions (3)

University of Cambridge¹, Institut Mines-Télécom², Queen Mary University of London³

05 Feb 2019-ACM Transactions on The Web

TL;DR: A comparative analysis of the usage and impact of bots and humans on Twitter—one of the largest OSNs in the world— draws clear differences and interesting similarities between the two entities.

...read moreread less

Abstract: Recent research has shown a substantial active presence of bots in online social networks (OSNs). In this article, we perform a comparative analysis of the usage and impact of bots and humans on Twitter—one of the largest OSNs in the world. We collect a large-scale Twitter dataset and define various metrics based on tweet metadata. Using a human annotation task, we assign “bot” and “human” ground-truth labels to the dataset and compare the annotations against an online bot detection tool for evaluation. We then ask a series of questions to discern important behavioural characteristics of bots and humans using metrics within and among four popularity groups. From the comparative analysis, we draw clear differences and interesting similarities between the two entities.

...read moreread less

41 citations

Journal Article•DOI•

“The Enemy Among Us”: Detecting Cyber Hate Speech with Threats-based Othering Language Embeddings

[...]

Wafa Alorainy¹, Peter Burnap¹, Han Liu¹, Matthew Leighton Williams¹•Institutions (1)

Cardiff University¹

26 Jul 2019-ACM Transactions on The Web

TL;DR: A novel "othering" feature set is proposed that utilizes language use around the concept of “othering” and intergroup threat theory to identify these subtleties, and a wide range of classification methods are implemented using embedding learning to compute semantic distances between parts of speech considered to be part of an “ othering’ narrative.

...read moreread less

Abstract: Offensive or antagonistic language targeted at individuals and social groups based on their personal characteristics (also known as cyber hate speech or cyberhate) has been frequently posted and widely circulated via the World Wide Web. This can be considered as a key risk factor for individual and societal tension surrounding regional instability. Automated Web-based cyberhate detection is important for observing and understanding community and regional societal tension—especially in online social networks where posts can be rapidly and widely viewed and disseminated. While previous work has involved using lexicons, bags-of-words, or probabilistic language parsing approaches, they often suffer from a similar issue, which is that cyberhate can be subtle and indirect—thus, depending on the occurrence of individual words or phrases, can lead to a significant number of false negatives, providing inaccurate representation of the trends in cyberhate. This problem motivated us to challenge thinking around the representation of subtle language use, such as references to perceived threats from “the other” including immigration or job prosperity in a hateful context. We propose a novel “othering” feature set that utilizes language use around the concept of “othering” and intergroup threat theory to identify these subtleties, and we implement a wide range of classification methods using embedding learning to compute semantic distances between parts of speech considered to be part of an “othering” narrative. To validate our approach, we conducted two sets of experiments. The first involved comparing the results of our novel method with state-of-the-art baseline models from the literature. Our approach outperformed all existing methods. The second tested the best performing models from the first phase on unseen datasets for different types of cyberhate, namely religion, disability, race, and sexual orientation. The results showed F-measure scores for classifying hateful instances obtained through applying our model of 0.81, 0.71, 0.89, and 0.72, respectively, demonstrating the ability of the “othering” narrative to be an important part of model generalization.

...read moreread less

34 citations

Journal Article•DOI•

Exploiting Usage to Predict Instantaneous App Popularity: Trend Filters and Retention Rates

[...]

Stephan Sigg¹, Eemil Lagerspetz², Ella Peltonen³, Petteri Nurmi², Sasu Tarkoma² - Show less +1 more•Institutions (3)

Aalto University¹, University of Helsinki², University College Cork³

02 Apr 2019-ACM Transactions on The Web

TL;DR: This study conducts the first independent and large-scale study of retention rates and usage trends on a dataset of app-usage data from a community of 339,842 users and more than 213,667 apps, and develops a novel app- usage trend measure which provides instantaneous information about the popularity of an application.

...read moreread less

Abstract: Popularity of mobile apps is traditionally measured by metrics such as the number of downloads, installations, or user ratings. A problem with these measures is that they reflect usage only indirectly. Indeed, retention rates, i.e., the number of days users continue to interact with an installed app, have been suggested to predict successful app lifecycles. We conduct the first independent and large-scale study of retention rates and usage trends on a dataset of app-usage data from a community of 339,842 users and more than 213,667 apps. Our analysis shows that, on average, applications lose 65% of their users in the first week, while very popular applications (top 100) lose only 35%. It also reveals, however, that many applications have more complex usage behaviour patterns due to seasonality, marketing, or other factors. To capture such effects, we develop a novel app-usage trend measure which provides instantaneous information about the popularity of an application. Analysis of our data using this trend filter shows that roughly 40% of all apps never gain more than a handful of users (Marginal apps). Less than 0.1% of the remaining 60% are constantly popular (Dominant apps), 1% have a quick drain of usage after an initial steep rise (Expired apps), and 6% continuously rise in popularity (Hot apps). From these, we can distinguish, for instance, trendsetters from copycat apps. We conclude by demonstrating that usage behaviour trend information can be used to develop better mobile app recommendations.

...read moreread less

19 citations

Journal Article•DOI•

Web Portals for High-performance Computing: A Survey

[...]

Patrice Calegari¹, Marc Levrier¹, Paweł Balczyński¹•Institutions (1)

Atos¹

05 Feb 2019-ACM Transactions on The Web

TL;DR: This article addresses web interfaces for High-performance Computing (HPC) simulation software by introducing HPC web-based portal use cases and identifies and discusses the key features, among functional and non-functional requirements, that characterize such portals.

...read moreread less

Abstract: This article addresses web interfaces for High-performance Computing (HPC) simulation software. First, it presents a brief history, starting in the 1990s with Java applets, of web interfaces used for accessing and making best possible use of remote HPC resources. It introduces HPC web-based portal use cases. Then it identifies and discusses the key features, among functional and non-functional requirements, that characterize such portals. A brief state of the art is then presented. The design and development of Bull extreme factory Computing Studio v3 (XCS3) is chosen as a common thread for showing how the identified key features can all be implemented in one software: multi-tenancy, multi-scheduler compatibility, complete control through an HTTP RESTful API, customizable user interface with Responsive Web Design, HPC application template framework, remote visualization, and access through the Authentication, Authorization, and Accounting security framework with the Role-Based Access Control permission model. Non-functional requirements (security, usability, performance, reliability) are discussed, and the article concludes by giving perspective for future work.

...read moreread less

18 citations

Journal Article•DOI•

Social Networks under Stress: Specialized Team Roles and Their Communication Structure

[...]

Daniel M. Romero¹, Brian Uzzi², Jon Kleinberg³•Institutions (3)

University of Michigan¹, Northwestern University², Cornell University³

08 Feb 2019-ACM Transactions on The Web

TL;DR: Analysis of a complete dataset of millions of instant messages among the decision-makers with different roles in a large hedge fund and their network of outside contacts finds changes in network structure predict shifts in cognitive and affective processes, execution of new transactions, and local optimality of transactions better than prices.

...read moreread less

Abstract: Social network research has begun to take advantage of fine-grained communications regarding coordination, decision-making, and knowledge sharing. These studies, however, have not generally analyzed how external events are associated with a social network’s structure and communicative properties. Here, we study how external events are associated with a network’s change in structure and communications. Analyzing a complete dataset of millions of instant messages among the decision-makers with different roles in a large hedge fund and their network of outside contacts, we investigate the link between price shocks, network structure, and change in the affect and cognition of decision-makers embedded in the network. We also analyze the communication dynamics among specialized teams in the organization. When price shocks occur the communication network tends not to display structural changes associated with adaptiveness such as the activation of weak ties to obtain novel information. Rather, the network “turtles up.” It displays a propensity for higher clustering, strong tie interaction, and an intensification of insider vs. outsider and within-role vs. between-role communication. Further, we find changes in network structure predict shifts in cognitive and affective processes, execution of new transactions, and local optimality of transactions better than prices, revealing the important predictive relationship between network structure and collective behavior within a social network.

...read moreread less

13 citations

Journal Article•DOI•

Learning Linear Influence Models in Social Networks from Transient Opinion Dynamics

[...]

Abir De¹, Sourangshu Bhattacharya², Parantapa Bhattacharya³, Niloy Ganguly², Soumen Chakrabarti⁴ - Show less +1 more•Institutions (4)

Max Planck Society¹, Indian Institute of Technology Kharagpur², University of Virginia³, Indian Institute of Technology Bombay⁴

11 Nov 2019-ACM Transactions on The Web

TL;DR: This article begins an investigation into a family of novel data-driven influence models that accurately learn and fit realistic observations that are robust to missing observations for several timesteps after an actor has changed its opinion.

...read moreread less

Abstract: Social networks, forums, and social media have emerged as global platforms for forming and shaping opinions on a broad spectrum of topics like politics, sports, and entertainment. Users (also called actors) often update their evolving opinions, influenced through discussions with other users. Theoretical models and their analysis on understanding opinion dynamics in social networks abound in the literature. However, these models are often based on concepts from statistical physics. Their goal is to establish specific phenomena like steady state consensus or bifurcation. Analysis of transient effects is largely avoided. Moreover, many of these studies assume that actors’ opinions are observed globally and synchronously, which is rarely realistic. In this article, we initiate an investigation into a family of novel data-driven influence models that accurately learn and fit realistic observations. We estimate and do not presume edge strengths from observed opinions at nodes. Our influence models are linear but not necessarily positive or row stochastic in nature. As a consequence, unlike the previous studies, they do not depend on system stability or convergence during the observation period. Furthermore, our models take into account a wide variety of data collection scenarios. In particular, they are robust to missing observations for several timesteps after an actor has changed its opinion. In addition, we consider scenarios where opinion observations may be available only for aggregated clusters of nodes—a practical restriction often imposed to ensure privacy. Finally, to provide a conceptually interpretable design of edge influence, we offer a relatively frugal variant of our influence model, where the strength of influence between two connecting nodes depends on the node attributes (demography, personality, expertise, etc.). Such an approach reduces the number of model parameters, reduces overfitting, and offers a tractable and explicable sketch of edge influences in the context of opinion dynamics. With six real-life datasets crawled from Twitter and Reddit, as well as three more datasets collected from in-house experiments (with 102 volunteers), our proposed system gives a significant accuracy boost over four state-of-the-art baselines.

...read moreread less

12 citations

Journal Article•DOI•

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

[...]

Julián Alarte¹, Josep Silva¹, Salvador Tamarit•Institutions (1)

Polytechnic University of Valencia¹

27 Mar 2019-ACM Transactions on The Web

TL;DR: This work implemented and evaluated five of the most advanced template extractors in the literature and implemented a workbench, which can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.

...read moreread less

Abstract: A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.

...read moreread less

9 citations

Journal Article•DOI•

FusE: Entity-Centric Data Fusion on Linked Data

[...]

Steffen Thoma¹, Andreas Thalhammer¹, Andreas Harth², Rudi Studer¹•Institutions (2)

Karlsruhe Institute of Technology¹, University of Erlangen-Nuremberg²

17 Feb 2019-ACM Transactions on The Web

TL;DR: This work presents FusE, an approach that identifies similar entity-specific data across sources, independent of the vocabulary and data modeling choices, and conducts experiments to underline the advantages of the presented entity-centric data fusion approach.

...read moreread less

Abstract: Many current web pages include structured data which can directly be processed and used. Search engines, in particular, gather that structured data and provide question answering capabilities over the integrated data with an entity-centric presentation of the results. Due to the decentralized nature of the web, multiple structured data sources can provide similar information about an entity. But data from different sources may involve different vocabularies and modeling granularities, which makes integration difficult. We present FusE, an approach that identifies similar entity-specific data across sources, independent of the vocabulary and data modeling choices. We apply our method along the scenario of a trustable knowledge panel, conduct experiments in which we identify and process entity data from web sources, and compare the output to a competing system. The results underline the advantages of the presented entity-centric data fusion approach.

...read moreread less

Journal Article•DOI•

User Studies on End-User Service Composition: A Literature Review and a Design Framework

[...]

Liping Zhao¹, Pericles Loucopoulos, Evangelia Kavakli², Keletso J. Letsholo³•Institutions (3)

University of Manchester¹, University of the Aegean², Botswana International University of Science and Technology³

26 Jul 2019-ACM Transactions on The Web

TL;DR: There is a gap in understanding of what constitutes a user study and how a good user study should be designed, conducted, and reported, so the development of a design framework and a set of questions for the design, reporting, and review of good user studies for EUSC are addressed.

...read moreread less

Abstract: Context: End-user service composition (EUSC) is a service-oriented paradigm that aims to empower end users and allow them to compose their own web applications from reusable service components. User studies have been used to evaluate EUSC tools and processes. Such an approach should benefit software development, because incorporating end users’ feedback into software development should make software more useful and usable. Problem: There is a gap in our understanding of what constitutes a user study and how a good user study should be designed, conducted, and reported. Goal: This article aims to address this gap. Method: The article presents a systematic review of 47 selected user studies for EUSC. Guided by a review framework, the article systematically and consistently assesses the focus, methodology and cohesion of each of these studies. Results: The article concludes that the focus of these studies is clear, but their methodology is incomplete and inadequate, their overall cohesion is poor. The findings lead to the development of a design framework and a set of questions for the design, reporting, and review of good user studies for EUSC. The detailed analysis and the insights obtained from the analysis should be applicable to the design of user studies for service-oriented systems as well and indeed for any user studies related to software artifacts.

...read moreread less

Journal Article•DOI•

Fast and Practical Snippet Generation for RDF Datasets

[...]

Daxin Liu¹, Gong Cheng², Qingxia Liu², Yuzhong Qu²•Institutions (2)

RWTH Aachen University¹, Nanjing University²

20 Dec 2019-ACM Transactions on The Web

TL;DR: A new algorithm is designed for the reuse of datasets that are partially accessible via online query services and adapted to trade off quality of snippet for feasibility and efficiency in the Web environment and develops an anytime algorithm that can generate empirically better solutions using additional time.

...read moreread less

Abstract: Triple-structured open data creates value in many ways. However, the reuse of datasets is still challenging. Users feel difficult to assess the usefulness of a large dataset containing thousands or millions of triples. To satisfy the needs, existing abstractive methods produce a concise high-level abstraction of data. Complementary to that, we adopt the extractive strategy and aim to select the optimum small subset of data from a dataset as a snippet to compactly illustrate the content of the dataset. This has been formulated as a combinatorial optimization problem in our previous work. In this article, we design a new algorithm for the problem, which is an order of magnitude faster than the previous one but has the same approximation ratio. We also develop an anytime algorithm that can generate empirically better solutions using additional time. To suit datasets that are partially accessible via online query services (e.g., SPARQL endpoints for RDF data), we adapt our algorithms to trade off quality of snippet for feasibility and efficiency in the Web environment. We carry out extensive experiments based on real RDF datasets and SPARQL endpoints for evaluating quality and running time. The results demonstrate the effectiveness and practicality of our proposed algorithms.

...read moreread less

Journal Article•DOI•

Long-term Measurement and Analysis of the Free Proxy Ecosystem

[...]

Diego Perino¹, Matteo Varvello, Claudio Soriente•Institutions (1)

Telefónica¹

20 Dec 2019-ACM Transactions on The Web

TL;DR: Through the analysis of more than 14TB of proxied traffic, it is shown that web browsing is the primary user activity and only half of these proxies have decent performance and can be used reliably.

...read moreread less

Abstract: Free web proxies promise anonymity and censorship circumvention at no cost. Several websites publish lists of free proxies organized by country, anonymity level, and performance. These lists index hundreds of thousands of hosts discovered via automated tools and crowd-sourcing. A complex free proxy ecosystem has been forming over the years, of which very little is known. In this article, we shed light on this ecosystem via a distributed measurement platform that leverages both active and passive measurements. Active measurements are carried out by an infrastructure we name ProxyTorrent, which discovers free proxies, assesses their performance, and detects potential malicious activities. Passive measurements focus on proxy performance and usage in the wild, and are accomplished by means of a Chrome extension named Ciao. ProxyTorrent has been running since January 2017, monitoring up to 230K free proxies. Ciao was launched in March 2017 and has thus far served roughly 9.7K users and generated 14TB of traffic. Our analysis shows that less than 2% of the proxies announced on the Web indeed proxy traffic on behalf of users; further, only half of these proxies have decent performance and can be used reliably. Every day, around 5%--10% of the active proxies exhibit malicious behaviors, e.g., advertisement injection, TLS interception, and cryptojacking, and these proxies are also the ones providing the best performance. Through the analysis of more than 14TB of proxied traffic, we show that web browsing is the primary user activity. Geo-blocking avoidance—allegedly a popular use case for free web proxies—accounts for 30% or less of the traffic, and it mostly involves countries hosting popular geo-blocked content.

...read moreread less

Journal Article•DOI•

Layout Cross-Platform and Cross-Browser Incompatibilities Detection using Classification of DOM Elements

[...]

Willian Massami Watanabe, Giovana Lázaro Amêndola, Fagner Christian Paes

22 Mar 2019-ACM Transactions on The Web

TL;DR: The proposed approach classifies each DOM element which composes a web application as an incompatibility or not, based on its attributes, position, alignment, screenshot, and the viewport width of the browser, which is an extension of previous Cross-browser incompatibility detection approaches.

...read moreread less

Abstract: Web applications can be accessed through a variety of user agent configurations, in which the browser, platform, and device capabilities are not under the control of developers. In order to grant the compatibility of a web application in each environment, developers must manually inspect their web application in a wide variety of devices, platforms, and browsers. Web applications can be rendered inconsistently depending on the browser, the platform, and the device capabilities which are used. Furthermore, the devices’ different viewport widths impact the way web applications are rendered in them, in which elements can be resized and change their absolute positions in the display. These adaptation strategies must also be considered in automatic incompatibility detection approaches in the state of the art. Hence, we propose a classification approach for detecting Layout Cross-platform and Cross-browser incompatibilities, which considers the adaptation strategies used in responsive web applications. Our approach is an extension of previous Cross-browser incompatibility detection approaches and has the goal of reducing the cost associated with manual inspections in different devices, platforms, and browsers, by automatically detecting Layout incompatibilities in this scenario. The proposed approach classifies each DOM element which composes a web application as an incompatibility or not, based on its attributes, position, alignment, screenshot, and the viewport width of the browser. We report the results of an experiment conducted with 42 Responsive Web Applications, rendered in three devices (Apple iPhone SE, Apple iPhone 8 Plus, and Motorola Moto G4) and browsers (Google Chrome and Apple Safari). The results (with F-measure of 0.70) showed evidence which quantify the effectiveness of our classification approach, and it could be further enhanced for detecting Cross-platform and Cross-browser incompatibilities. Furthermore, in the experiment, our approach also performed better when compared to a former state-of-the-art classification technique for Cross-browser incompatibilities detection.

...read moreread less

Journal Article•DOI•

Combining URL and HTML Features for Entity Discovery in the Web

[...]

Edimar Manica¹, Carina F. Dorneles², Renata Galante³•Institutions (3)

Federal Institute of Rio Grande do Sul¹, Universidade Federal de Santa Catarina², Universidade Federal do Rio Grande do Sul³

20 Dec 2019-ACM Transactions on The Web

TL;DR: The novelty of the method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased.

...read moreread less

Abstract: The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

...read moreread less

Journal Article•DOI•

Efficient Pairwise Penetrating-rank Similarity Retrieval

[...]

Weiren Yu¹, Julie A. McCann², Chengyuan Zhang³•Institutions (3)

Nanjing University of Science and Technology¹, Imperial College London², Central South University³

20 Dec 2019-ACM Transactions on The Web

TL;DR: This article considers the optimization techniques for P-Rank search that encompasses its accuracy, stability, and computational efficiency and proposes two matrix-based algorithms, applicable to digraphs and undirected graphs, which improve the computational time of P- Rank computation.

...read moreread less

Abstract: Many web applications demand a measure of similarity between two entities, such as collaborative filtering, web document ranking, linkage prediction, and anomaly detection. P-Rank (Penetrating-Rank) has been accepted as a promising graph-based similarity measure, as it provides a comprehensive way of encoding both incoming and outgoing links into assessment. However, the existing method to compute P-Rank is iterative in nature and rather cost-inhibitive. Moreover, the accuracy estimate and stability issues for P-Rank computation have not been addressed. In this article, we consider the optimization techniques for P-Rank search that encompasses its accuracy, stability, and computational efficiency. (1) The accuracy estimation is provided for P-Rank iterations, with the aim to find out the number of iterations, k, required to guarantee a desired accuracy. (2) A rigorous bound on the condition number of P-Rank is obtained for stability analysis. Based on this bound, it can be shown that P-Rank is stable and well-conditioned when the damping factors are chosen to be suitably small. (3) Two matrix-based algorithms, applicable to digraphs and undirected graphs, are, respectively, devised for efficient P-Rank computation, which improves the computational time from O(kn3) to O(υ n2+υ6) for digraphs, and to O(υn2) for undirected graphs, where n is the number of vertices in the graph, and υ (L n) is the target rank of the graph. Moreover, our proposed algorithms can significantly reduce the memory space of P-Rank computations from O(n2) to O(υn+υ4) for digraphs, and to O(υ n) for undirected graphs, respectively. Finally, extensive experiments on real-world and synthetic datasets demonstrate the usefulness and efficiency of the proposed techniques for P-Rank similarity assessment on various networks.

...read moreread less