Showing papers on "Web page published in 2018"

PDF

Open Access

Journal Article•DOI•

EzMol: A Web Server Wizard for the Rapid Visualization and Image Production of Protein and Nucleic Acid Structures.

[...]

Christopher R. Reynolds¹, Suhail A. Islam¹, Michael J.E. Sternberg¹•Institutions (1)

20 Jul 2018-Journal of Molecular Biology

TL;DR: EzMol allows the upload of molecular structure files in PDB format to generate a Web page including a representation of the structure that the user can manipulate, and provides intuitive options for chain display, adjusting the color/transparency of residues, side chains and protein surfaces, and for adding labels to residues.

...read moreread less

132 citations

Proceedings Article•DOI•

How You Get Shot in the Back: A Systematical Study about Cryptojacking in the Real World

[...]

Geng Hong¹, Zhemin Yang¹, Sen Yang¹, Lei Zhang¹, Yuhong Nan¹, Zhibo Zhang¹, Min Yang¹, Yuan Zhang¹, Zhiyun Qian², Haixin Duan³ - Show less +6 more•Institutions (3)

Fudan University¹, University of California, Riverside², Tsinghua University³

15 Oct 2018

TL;DR: CMTracker is built, a behavior-based detector with two runtime profilers for automatically tracking Cryptocurrency Mining scripts and their related domains and gains a more comprehensive picture of the cryptojacking attacks, including their impact, distribution mechanisms, obfuscation, and attempts to evade detection.

...read moreread less

Abstract: As a new mechanism to monetize web content, cryptocurrency mining is becoming increasingly popular. The idea is simple: a webpage delivers extra workload (JavaScript) that consumes computational resources on the client machine to solve cryptographic puzzles, typically without notifying users or having explicit user consent. This new mechanism, often heavily abused and thus considered a threat termed "cryptojacking", is estimated to affect over 10 million web users every month; however, only a few anecdotal reports exist so far and little is known about its severeness, infrastructure, and technical characteristics behind the scene. This is likely due to the lack of effective approaches to detect cryptojacking at a large-scale (e.g., VirusTotal). In this paper, we take a first step towards an in-depth study over cryptojacking. By leveraging a set of inherent characteristics of cryptojacking scripts, we build CMTracker, a behavior-based detector with two runtime profilers for automatically tracking Cryptocurrency Mining scripts and their related domains. Surprisingly, our approach successfully discovered 2,770 unique cryptojacking samples from 853,936 popular web pages, including 868 among top 100K in Alexa list. Leveraging these samples, we gain a more comprehensive picture of the cryptojacking attacks, including their impact, distribution mechanisms, obfuscation, and attempts to evade detection. For instance, a diverse set of organizations benefit from cryptojacking based on the unique wallet ids. In addition, to stay under the radar, they frequently update their attack domains (fastflux) on the order of days. Many attackers also apply evasion techniques, including limiting the CPU usage, obfuscating the code, etc.

...read moreread less

94 citations

Journal Article•DOI•

Using Psychophysiological Sensors to Assess Mental Workload During Web Browsing.

[...]

Angel Jimenez-Molina¹, Cristian Retamal¹, Hernan Lira¹•Institutions (1)

University of Chile¹

03 Feb 2018-Sensors

TL;DR: The correlation between stimuli and physiological responses, which are measured with high-frequency, non-invasive psychophysiological sensors during very short span windows, are leveraged to identify levels of mental workload through the analysis of pupil dilation measured by an eye-tracking sensor.

...read moreread less

Abstract: Knowledge of the mental workload induced by a Web page is essential for improving users’ browsing experience. However, continuously assessing the mental workload during a browsing task is challenging. To address this issue, this paper leverages the correlation between stimuli and physiological responses, which are measured with high-frequency, non-invasive psychophysiological sensors during very short span windows. An experiment was conducted to identify levels of mental workload through the analysis of pupil dilation measured by an eye-tracking sensor. In addition, a method was developed to classify mental workload by appropriately combining different signals (electrodermal activity (EDA), electrocardiogram, photoplethysmo-graphy (PPG), electroencephalogram (EEG), temperature and pupil dilation) obtained with non-invasive psychophysiological sensors. The results show that the Web browsing task involves four levels of mental workload. Also, by combining all the sensors, the efficiency of the classification reaches 93.7%.

...read moreread less

63 citations

Proceedings Article•DOI•

Rousillon: Scraping Distributed Hierarchical Web Data

[...]

Sarah Chasins¹, Maria Mueller², Rastislav Bodik²•Institutions (2)

University of California, Berkeley¹, University of Washington²

11 Oct 2018

TL;DR: This work presents Rousillon, a programming system for writing complex web automation scripts by demonstration, and developed novel relation selection and generalization algorithms that can be used to write hierarchically-structured data from across many different webpages.

...read moreread less

Abstract: Programming by Demonstration (PBD) promises to enable data scientists to collect web data. However, in formative interviews with social scientists, we learned that current PBD tools are insufficient for many real-world web scraping tasks. The missing piece is the capability to collect hierarchically-structured data from across many different webpages. We present Rousillon, a programming system for writing complex web automation scripts by demonstration. Users demonstrate how to collect the first row of a 'universal table' view of a hierarchical dataset to teach Rousillon how to collect all rows. To offer this new demonstration model, we developed novel relation selection and generalization algorithms. In a within-subject user study on 15 computer scientists, users can write hierarchical web scrapers 8 times more quickly with Rousillon than with traditional programming.

...read moreread less

60 citations

Proceedings Article•DOI•

Challenges and Innovations in Building a Product Knowledge Graph

[...]

Xin Luna Dong¹•Institutions (1)

Amazon.com¹

19 Jul 2018

TL;DR: Three advanced extraction technologies to harvest product knowledge from semi-structured sources on the web and from text product profiles are developed, and the OpenTag technique extends state-of-the-art techniques such as Recursive Neural Network and Conditional Random Field with attention and active learning.

...read moreread less

Abstract: Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph. In this talk we describe four scientific directions we are investigating in building and using such a knowledge graph. First, we have been developing advanced extraction technologies to harvest product knowledge from semi-structured sources on the web and from text product profiles. Our annotation-based extraction tool selects a few webpages (typically below 10 pages) from a website for annotations, and can derive XPaths to extract from the whole website with average precision and recall of 97% [1]. Our distantly supervised extraction tool, CERES, uses an existing knowledge graph to automatically generate (noisy) training labels, and can obtain a precision over 90% when extracting from long-tail websites in various languages [1]. Our OpenTag technique extends state-of-the-art techniques such as Recursive Neural Network (RNN) and Conditional Random Field with attention and active learning, to achieve over 90% precision and recall in extracting attribute values (including values unseen in training data) from product titles, descriptions, and bullets [3].

...read moreread less

59 citations

Proceedings Article•DOI•

Towards a Context-Aware IDE-Based Meta Search Engine for Recommendation about Programming Errors and Exceptions

[...]

Mohammad Masudur Rahman¹, Shamima Yeasmin¹, Chanchal K. Roy¹•Institutions (1)

University of Saskatchewan¹

06 Jul 2018-arXiv: Software Engineering

TL;DR: This paper proposes an Eclipse IDE-based web search solution that exploits the APIs provided by three popular web search engines-Google, Yahoo, Bing and a popular programming Q & A site, StackOverflow, and captures the content-relevance, context-relevant, popularity and search engine confidence of each candidate result against the encountered programming problems.

...read moreread less

Abstract: Study shows that software developers spend about 19% of their time looking for information in the web during software development and maintenance. Traditional web search forces them to leave the working environment (e.g., IDE) and look for information in the web browser. It also does not consider the context of the problems that the developers search solutions for. The frequent switching between web browser and the IDE is both time-consuming and distracting, and the keyword-based traditional web search often does not help much in problem solving. In this paper, we propose an Eclipse IDE-based web search solution that exploits the APIs provided by three popular web search engines-- Google, Yahoo, Bing and a popular programming Q & A site, Stack Overflow, and captures the content-relevance, context-relevance, popularity and search engine confidence of each candidate result against the encountered programming problems. Experiments with 75 programming errors and exceptions using the proposed approach show that inclusion of different types of context information associated with a given exception can enhance the recommendation accuracy of a given exception. Experiments both with two existing approaches and existing web search engines confirm that our approach can perform better than them in terms of recall, mean precision and other performance measures with little computational cost.

...read moreread less

59 citations

Proceedings Article•DOI•

JustIoT Internet of Things based on the Firebase real-time database

[...]

Wu-Jeng Li¹, Chiaming Yen¹, You-Sheng Lin¹, Shu-Chu Tung², Shih-Miao Huang¹ - Show less +1 more•Institutions (2)

National Formosa University¹, Kun Shan University²

01 Feb 2018

TL;DR: This paper designs an Internet of Things system (called JustIoT) which is mainly divided into four parts; back-end Google Firebase real-time database, front-end SPA (Single Page Application) web monitoring program, controller software-hardware, and intelligence server that support MQTT connection and condition control.

...read moreread less

Abstract: This paper designs an Internet of Things system (called JustIoT) which is mainly divided into four parts; back-end Google Firebase real-time database, front-end SPA (Single Page Application) web monitoring program (including mobile monitoring App), controller software-hardware, and intelligence server that support MQTT connection and condition control. JustIoT receives data from all kinds of controllers, allowing users to set the control rules and remote monitoring and control. JustIOT distinguishes users from managers, vendors, customers, registrants, and visitors. Users can build applications in the above system, to serve customers, and to run a business. In the JustIoT, management web page based on Angular front-end technology is connected to the Firebase real-time database. The event of data modification of Firebase database can trigger Angular's two-way data binding to achieve Three-way data binding effect to implement server-less architecture easily. The data in Firebase database is read and written by the front-end devices (web apps, mobile apps, and controllers) directly. The intelligence server is an MQTT server that supports the connections of relatively weak embedded controllers such as the Arduino controller. The intelligent server can be considered as an intermediary between the Firebase real-time database and weak controllers, which performs the transfer of data and remote commands. The intelligent server is also the intelligent computing center of the JustIoT. It performs condition control.

...read moreread less

57 citations

Journal Article•DOI•

CERES: distantly supervised relation extraction from the semi-structured web

[...]

Colin Lockard¹, Xin Luna Dong², Arash Einolghozati³, Prashant Shiralkar²•Institutions (3)

University of Washington¹, Amazon.com², Facebook³

01 Jun 2018

TL;DR: In this article, a method for automatic extraction from semi-structured websites based on distant supervision is presented. But this method is not suitable for settings with complex schemas and information-rich websites.

...read moreread less

Abstract: The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites.In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a website and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

...read moreread less

56 citations

Journal Article•DOI•

Challenges to Assess Accessibility in Higher Education Websites: A Comparative Study of Latin America Universities

[...]

Patricia Acosta-Vargas¹, Tania Acosta², Sergio Luján-Mora³•Institutions (3)

Universidad de las Américas Puebla¹, National Technical University², University of Alicante³

19 Jun 2018-IEEE Access

TL;DR: The results show that the universities’ websites have frequent problems related to the lack of alternative image text, and indicate that it is necessary to strengthen Web accessibility policies in each country and apply better directives in this area to make Websites more inclusive.

...read moreread less

Abstract: The Web has revolutionized our daily lives, becoming a prime source of information, knowledge, inquiry, and provision of services in various areas. It is possible to obtain information easily from any institution through the Internet; in fact, the first impression of an organization an individual perceives is almost always based on its official website. Services related to education are increasing worldwide; therefore, it is important that users, regardless of their disabilities, be able to access these websites in an effective manner. However, the homepages of universities in Latin America still do not meet web accessibility criteria. This paper describes the problems of web accessibility identified in 348 main university websites in Latin America according to their rankings on Webometrics. The results show that the universities’ websites have frequent problems related to the lack of alternative image text. It was found that the university websites included in the present study violate Web accessibility requirements based on the Web Content Accessibility Guidelines 2.0. The many problems identified concerning Website accessibility indicate that it is necessary to strengthen Web accessibility policies in each country and apply better directives in this area to make Websites more inclusive.

...read moreread less

56 citations

Journal Article•DOI•

The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents

[...]

D Gunawan¹, C A Sembiring¹, Mohammad Andri Budiman¹•Institutions (1)

University of North Sumatra¹

01 Mar 2018

TL;DR: This is a preliminary research that uses cosine similarity to implement text relevance in order to find topic specific document using Porter Stemming algorithm.

...read moreread less

Abstract: Rapidly increasing number of web pages or documents leads to topic specific filtering in order to find web pages or documents efficiently. This is a preliminary research that uses cosine similarity to implement text relevance in order to find topic specific document. This research is divided into three parts. The first part is text-preprocessing. In this part, the punctuation in a document will be removed, then convert the document to lower case, implement stop word removal and then extracting the root word by using Porter Stemming algorithm. The second part is keywords weighting. Keyword weighting will be used by the next part, the text relevance calculation. Text relevance calculation will result the value between 0 and 1. The closer value to 1, then both documents are more related, vice versa.

...read moreread less

55 citations

Journal Article•DOI•

Ranking Analysis for Online Customer Reviews of Products Using Opinion Mining with Clustering

[...]

S. K. Lakshmanaprabu¹, K. Shankar², Deepak Gupta³, Ashish Khanna³, Joel J. P. C. Rodrigues⁴, Joel J. P. C. Rodrigues⁵, Plácido Rogério Pinheiro⁶, Victor Hugo C. de Albuquerque⁶ - Show less +4 more•Institutions (6)

B. S. Abdur Rahman University¹, Kalasalingam University², Maharaja Agrasen Institute of Technology³, Inatel⁴, Saint Petersburg State University of Information Technologies, Mechanics and Optics⁵, University of Fortaleza⁶

06 Sep 2018-Complexity

TL;DR: The Dragonfly Algorithm (DA)—recognizes ideal features of the items in sites, and an advanced ideal feature-based positioning procedure will be directed to discover, at long last, which web-based business webpage is best and easy to understand.

...read moreread less

Abstract: Sites for web-based shopping are winding up increasingly famous these days. Organizations are anxious to think about their client purchasing conduct to build their item deal. Internet shopping is a method for powerful exchange among cash and merchandise which is finished by end clients without investing a huge energy spam. The goal of this paper is to dissect the high-recommendation web-based business sites with the help of a collection strategy and a swarm-based improvement system. At first, the client surveys of the items from web-based business locales with a few features were gathered and, afterward, a fuzzy c-means (FCM) grouping strategy to group the features for a less demanding procedure was utilized. Also, the novelty of this work—the Dragonfly Algorithm (DA)—recognizes ideal features of the items in sites, and an advanced ideal feature-based positioning procedure will be directed to discover, at long last, which web-based business webpage is best and easy to understand. From the execution, the outcomes demonstrate the greatest exactness rate, that is, 94.56% compared with existing methods.

...read moreread less

Journal Article•DOI•

SmilesDrawer: Parsing and Drawing SMILES-Encoded Molecular Structures Using Client-Side JavaScript

[...]

Daniel Probst¹, Jean-Louis Reymond¹•Institutions (1)

University of Bern¹

11 Jan 2018-Journal of Chemical Information and Modeling

TL;DR: SmilesDrawer can draw structurally and stereochemically complex structures such as maitotoxin and C60 without using templates, yet has an exceptionally small computational footprint and low memory usage without the requirement for loading images or any other form of client-server communication.

...read moreread less

Abstract: Here we present SmilesDrawer, a dependency-free JavaScript component capable of both parsing and drawing SMILES-encoded molecular structures client-side, developed to be easily integrated into web projects and to display organic molecules in large numbers and fast succession. SmilesDrawer can draw structurally and stereochemically complex structures such as maitotoxin and C60 without using templates, yet has an exceptionally small computational footprint and low memory usage without the requirement for loading images or any other form of client–server communication, making it easy to integrate even in secure (intranet, firewalled) or offline applications. These features allow the rendering of thousands of molecular structure drawings on a single web page within seconds on a wide range of hardware supporting modern browsers. The source code as well as the most recent build of SmilesDrawer is available on Github (http://doc.gdb.tools/smilesDrawer/). Both yarn and npm packages are also available.

...read moreread less

Proceedings Article•DOI•

Gender-Inclusive Design: Sense of Belonging and Bias in Web Interfaces

[...]

Danaë Metaxa-Kakavouli¹, Kelly Wang², James A. Landay¹, Jeffrey T. Hancock¹•Institutions (2)

Stanford University¹, Brown University²

21 Apr 2018

TL;DR: It is confirmed that young women exposed to the masculine page are negatively affected, reporting significantly less ambient belonging, interest in the course and in studying computer science broadly, and the need for inclusive user interface design for the web is highlighted.

...read moreread less

Abstract: We interact with dozens of web interfaces on a daily basis, making inclusive web design practices more important than ever. This paper investigates the impacts of web interface design on ambient belonging, or the sense of belonging to a community or culture. Our experiment deployed two content-identical webpages for an introductory computer science course, differing only in aesthetic features such that one was perceived as masculine while the other was gender-neutral. Our results confirm that young women exposed to the masculine page are negatively affected, reporting significantly less ambient belonging, interest in the course and in studying computer science broadly. They also experience significantly more concern about others' perception of their gender relative to young women exposed to the neutral page, while no similar effect is seen in young men. These results suggest that gender biases can be triggered by web design, highlighting the need for inclusive user interface design for the web.

...read moreread less

Proceedings Article•DOI•

An Efficient Bandit Algorithm for Realtime Multivariate Optimization.

[...]

Daniel Hill¹, Houssam Nassif¹, Yi Liu¹, Anand Iyer¹, S. V. N. Vishwanathan² - Show less +1 more•Institutions (2)

Amazon.com¹, University of California, Santa Cruz²

22 Oct 2018-arXiv: Learning

TL;DR: This work formulates an approach where the possible interactions between different components of the page are modeled explicitly and applies bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime.

...read moreread less

Abstract: Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to display, etc. Such optimization is a combinatorial problem over an exponentially large decision space. Randomized experiments do not scale well to this setting, and therefore, in practice, one is typically limited to optimizing a single aspect of a web page at a time. This represents a missed opportunity in both the speed of experimentation and the exploitation of possible interactions between layout decisions. Here we focus on multivariate optimization of interactive web pages. We formulate an approach where the possible interactions between different components of the page are modeled explicitly. We apply bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime. Our algorithm also extends to contextualization and personalization of layout selection. Simulation results show the suitability of our approach to large decision spaces with strong interactions between content. We further apply our algorithm to optimize a message that promotes adoption of an Amazon service. After only a single week of online optimization, we saw a 21% conversion increase compared to the median layout. Our technique is currently being deployed to optimize content across several locations at Amazon.com.

...read moreread less

Proceedings Article•DOI•

Detecting Autism Based on Eye-Tracking Data from Web Searching Tasks

[...]

Victoria Yaneva¹, Le An Ha¹, Sukru Eraslan², Yeliz Yesilada², Ruslan Mitkov¹ - Show less +1 more•Institutions (2)

University of Wolverhampton¹, Middle East Technical University Northern Cyprus Campus²

23 Apr 2018

TL;DR: Preliminary results show that the differences in the way people with autism process web content could be used for the future development of serious games for autism screening and the effects of the type of the task performed are explored.

...read moreread less

Abstract: The ASD diagnosis requires a long, elaborate, and expensive procedure, which is subjective and is currently restricted to behavioural, historical, and parent-report information. In this paper, we present an alternative way for detecting the condition based on the atypical visual-attention patterns of people with autism. We collect gaze data from two different kinds of tasks related to processing of information from web pages: Browsing and Searching. The gaze data is then used to train a machine learning classifier whose aim is to distinguish between participants with autism and a control group of participants without autism. In addition, we explore the effects of the type of the task performed, different approaches to defining the areas of interest, gender, visual complexity of the web pages and whether or not an area of interest contained the correct answer to a searching task. Our best-performing classifier achieved 0.75 classification accuracy for a combination of selected web pages using all gaze features. These preliminary results show that the differences in the way people with autism process web content could be used for the future development of serious games for autism screening. The gaze data, R code, visual stimuli and task descriptions are made freely available for replication purposes.

...read moreread less

Proceedings Article•DOI•

Web-based Attacks to Discover and Control Local IoT Devices

[...]

Gunes Acar¹, Danny Yuxing Huang¹, Frank Li², Arvind Narayanan¹, Nick Feamster¹ - Show less +1 more•Institutions (2)

Princeton University¹, University of California, Berkeley²

07 Aug 2018

TL;DR: This paper presents two web-based attacks against local IoT devices that any malicious web page or third-party script can perform, even when the devices are behind NATs.

...read moreread less

Abstract: In this paper, we present two web-based attacks against local IoT devices that any malicious web page or third-party script can perform, even when the devices are behind NATs. In our attack scenario, a victim visits the attacker's website, which contains a malicious script that communicates with IoT devices on the local network that have open HTTP servers. We show how the malicious script can circumvent the same-origin policy by exploiting error messages on the HTML5 MediaError interface or by carrying out DNS rebinding attacks. We demonstrate that the attacker can gather sensitive information from the devices (e.g., unique device identifiers and precise geolocation), track and profile the owners to serve ads, or control the devices by playing arbitrary videos and rebooting. We propose potential countermeasures to our attacks that users, browsers, DNS providers, and IoT vendors can implement.

...read moreread less

Proceedings Article•DOI•

HTTP/2 Prioritization and its Impact on Web Performance

[...]

Maarten Wijnants¹, Robin Marx¹, Peter Quax¹, Wim Lamotte¹•Institutions (1)

University of Hasselt¹

10 Apr 2018

TL;DR: An extensive survey of modern User Agent implementations is detailed, with the conclusion that the major vendors all approach HTTP/2 prioritization in widely different ways, from naive (Safari, IE, Edge) to complex (Chrome, Firefox).

...read moreread less

Abstract: Web performance is a hot topic, as many studies have shown a strong correlation between slow webpages and loss of revenue due to user dissatisfaction. Front and center in Page Load Time (PLT) optimization is the order in which resources are downloaded and processed. The new HTTP/2 specification includes dedicated resource prioritization provisions, to be used in tandem with resource multiplexing over a single, well-filled TCP connection. However, little is yet known about its application by browsers and its impact on page load performance. This article details an extensive survey of modern User Agent implementations, with the conclusion that the major vendors all approach HTTP/2 prioritization in widely different ways, from naive (Safari, IE, Edge) to complex (Chrome, Firefox). We investigate the performance effect of these discrepancies with a full-factorial experimental evaluation involving eight prioritization algorithms, two off-the-shelf User Agents, 40 realistic webpages, and five heterogeneous (emulated) network conditions. We find that in general the complex approaches yield the best results, while naive schemes can lead to over 25% slower median visual load times. Also, prioritization is found to matter most for heavy-weight pages. Finally, it is ascertained that achieving PLT optimizations via generic server-side HTTP/2 re-prioritization schemes is a non-trivial task and that their performance is influenced by the implementation intricacies of individual browsers.

...read moreread less

Journal Article•DOI•

On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl

[...]

Sebastian Schelter¹, Jérôme Kunegis²•Institutions (2)

Technical University of Berlin¹, University of Koblenz and Landau²

26 Jan 2018

TL;DR: It is confirmed that trackers are widespread, and that a small number of trackers dominates the web (Google, Facebook and Twitter), and that Google still operates services on Chinese websites, despite its proclaimed retreat from the Chinese market.

...read moreread less

Abstract: We perform a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5~billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest empirical web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of domains and web pages analyzed. Due to the enormous size of the dataset, we are able to perform a large-scale study of online tracking, on three levels: (1) On a global level, we give a precise figure for the extent of tracking, give insights into the structural properties of the `online tracking sphere' and analyse which trackers (and subsequently, which companies) are used by how many websites. (2) On a country-specific level, we analyse which trackers are used by websites in different countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. (3) We answer the question whether the content of websites influences the choice of trackers they use, leveraging more than ninety thousand categorized domains. In particular, we analyse whether highly privacy-critical websites about health and addiction make different choices of trackers than other websites. Based on the performed analyses, we confirm that trackers are widespread (as expected), and that a small number of trackers dominates the web (Google, Facebook and Twitter). In particular, the three tracking domains with the highest PageRank are all owned by Google. The only exception to this pattern are a few countries such as China and Russia. Our results suggest that this dominance is strongly associated with country-specific political factors such as freedom of the press. Furthermore, our data confirms that Google still operates services on Chinese websites, despite its proclaimed retreat from the Chinese market. We also confirm that websites with highly privacy-critical content are less likely to contain trackers (60\% vs 90\% for other websites), even though the majority of them still do contain trackers.

...read moreread less

Proceedings Article•DOI•

A Multi-tab Website Fingerprinting Attack

[...]

Yixiao Xu¹, Tao Wang², Qi Li¹, Qingyuan Gong³, Yang Chen³, Yong Jiang¹ - Show less +2 more•Institutions (3)

Tsinghua University¹, Hong Kong University of Science and Technology², Fudan University³

03 Dec 2018

TL;DR: A multi-tab website fingerprinting attack that can accurately classify multi- tab web pages if they are requested and sequentially loaded over a short period of time is proposed and achieves a significantly higher true positive rate using a restricted chunk of packets.

...read moreread less

Abstract: In a Website Fingerprinting (WF) attack, a local, passive eavesdropper utilizes network flow information to identify which web pages a user is browsing. Previous researchers have extensively demonstrated the feasibility and effectiveness of WF, but only under the strong Single Page Assumption: the network flow extracted by the adversary always belongs to a single page. In other words, the WF classifier will never be asked to classify a network flow corresponding to more than one page, or part of a page. The Single Page Assumption is unrealistic because people often browse with multiple tabs. When this happens, the network flow induced by multiple tabs will overlap, and current WF attacks fail to classify correctly. Our work demonstrates the feasibility of WF with the relaxed Single Page Assumption: we can attack a client who visits more than one pages simultaneously. We propose a multi-tab website fingerprinting attack that can accurately classify multi-tab web pages if they are requested and sequentially loaded over a short period of time. In particular, we develop a new BalanceCascade-XGBoost scheme for an attacker to identify the start point of the second page such that the attacker can accurately classify and identify these multi-tab pages. By developing a new classifier, we only use a small chunk of packets, i.e., packets between the first page's start time to the second page's start time, to fingerprint website. Our experiments demonstrate that in the multi-tab scenario, WF attacks are still practically effective. We have an average TPR of 92.58% on SSH, and we can also averagely identify the page with a TPR of 64.94% on Tor. Specially, compared with previous WF classifiers, our attack achieves a significantly higher true positive rate using a restricted chunk of packets.

...read moreread less

Posted Content•

CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

[...]

Colin Lockard¹, Xin Luna Dong², Arash Einolghozati³, Prashant Shiralkar²•Institutions (3)

University of Washington¹, Amazon.com², Facebook³

12 Apr 2018-arXiv: Artificial Intelligence

TL;DR: In this paper, a method for automatic extraction from semi-structured websites based on distant supervision is presented. But this method is not suitable for settings with complex schemas and information-rich websites.

...read moreread less

Abstract: The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

...read moreread less

Proceedings Article•DOI•

BrowseWithMe: An Online Clothes Shopping Assistant for People with Visual Impairments

[...]

Abigale Stangl¹, Esha Kothari², Suyog Dutt Jain², Tom Yeh¹, Kristen Grauman², Danna Gurari² - Show less +2 more•Institutions (2)

University of Colorado Boulder¹, University of Texas at Austin²

08 Oct 2018

TL;DR: Experiments demonstrate BrowseWithMe can make online clothes shopping more accessible and produce accurate image descriptions.

...read moreread less

Abstract: Our interviews with people who have visual impairments show clothes shopping is an important activity in their lives. Unfortunately, clothes shopping web sites remain largely inaccessible. We propose design recommendations to address online accessibility issues reported by visually impaired study participants and an implementation, which we call BrowseWithMe, to address these issues. BrowseWithMe employs artificial intelligence to automatically convert a product web page into a structured representation that enables a user to interactively ask the BrowseWithMe system what the user wants to learn about a product (e.g., What is the price? Can I see a magnified image of the pants?). This enables people to be active solicitors of the specific information they are seeking rather than passive listeners of unparsed information. Experiments demonstrate BrowseWithMe can make online clothes shopping more accessible and produce accurate image descriptions.

...read moreread less

Proceedings Article•DOI•

Speed Index: Relating the Industrial Standard for User Perceived Web Performance to web QoE

[...]

Tobias Hobfeld¹, Florian Metzger¹, Dario Rossi²•Institutions (2)

University of Würzburg¹, Télécom ParisTech²

01 May 2018

TL;DR: The analysis shows that ATF-based metrics are more appropriate than pure PLT as input to Web QoE models, and develops an understanding of the SI based on a theoretical analysis and analyzes the interdependency between SI and MOS values from an existing public dataset.

...read moreread less

Abstract: In 2012, Google introduced the Speed Index (SI) metric to quantify the speed of the Web page visual completeness for the actually displayed above-the-fold (ATF) portion of a Web page In Web browsing a page might appear to the user to be already fully rendered, even though further content may still be retrieved, resulting in the Page Load Time (PLT) This happens due to the browser progressively rendering all objects, part of which can also be located below the browser window's current viewport The SI metric (and variants) thereof have since established themselves as a de facto standard in Web page and browser testing While SI is a step in the direction of including the user experience into Web metrics, the actual meaning of the metric and especially its relationship between Speed Index and Web QoE is however far from being clear The contributions of this paper are thus to first develop an understanding of the SI based on a theoretical analysis and second, to analyze the interdependency between SI and MOS values from an existing public dataset Specifically, our analysis is based on two well established models that map the user waiting time to a user ACR-rating of the QoE The analysis show that ATF-based metrics are more appropriate than pure PLT as input to Web QoE models

...read moreread less

Proceedings Article•

Vesper: Measuring time-to-interactivity for web pages

[...]

Ravi Netravali¹, Vikram Nathan¹, James Mickens², Hari Balakrishnan¹•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

01 Jan 2018

TL;DR: This work argues that, for pages that care about user interaction, load times should be defined with respect to interactivity: a page is “loaded” when above-the-fold content is visible, and the associated JavaScript event handling state is functional.

...read moreread less

Abstract: Everyone agrees that web pages should load more quickly. However, a good definition for “page load time” is elusive. We argue that, for pages that care about user interaction, load times should be defined with respect to interactivity: a page is “loaded” when above-the-fold content is visible, and the associated JavaScript event handling state is functional. We define a new load time metric, called Ready Index, which explicitly captures our proposed notion of load time. Defining the metric is straightforward, but actually measuring it is not, since web developers do not explicitly annotate the JavaScript state and the DOM elements which support interactivity. To solve this problem, we introduce Vesper, a tool that rewrites a page’s JavaScript and HTML to automatically discover the page’s interactive state. Armed with Vesper, we compare Ready Index to prior load time metrics like Speed Index; across a variety of network conditions, prior metrics underestimate or overestimate the true load time for a page by 24%–64%. We introduce a tool that optimizes a page for Ready Index, decreasing the median time to page interactivity by 29%–32%.

...read moreread less

Book Chapter•DOI•

Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl

[...]

Janek Bevendorff¹, Benno Stein¹, Matthias Hagen¹, Martin Potthast¹•Institutions (1)

Bauhaus University, Weimar¹

26 Mar 2018

TL;DR: Elastic ChatNoir’s main purpose is to serve as a baseline for reproducible IR experiments and user studies for the coming years, empowering research at a scale not attainable to many labs beforehand, and to provide a platform for experimenting with new approaches to web search.

...read moreread less

Abstract: Elastic ChatNoir (Search:www.chatnoir.eu Code:www.github.com/chatnoir-eu) is an Elasticsearch-based search engine offering a freely accessible search interface for the two ClueWeb corpora and the Common Crawl, together about 3 billion web pages. Running across 130 nodes, Elastic ChatNoir features subsecond response times comparable to commercial search engines. Unlike most commercial search engines, it also offers a powerful API that is available free of charge to IR researchers. Elastic ChatNoir’s main purpose is to serve as a baseline for reproducible IR experiments and user studies for the coming years, empowering research at a scale not attainable to many labs beforehand, and to provide a platform for experimenting with new approaches to web search.

...read moreread less

Proceedings Article•DOI•

Automated repair of mobile friendly problems in web pages

[...]

Sonai Mahajan¹, Negarsadat Abolhassani¹, Phil McMinn², William G. J. Halfond¹•Institutions (2)

University of Southern California¹, University of Sheffield²

27 May 2018

TL;DR: A novel automated approach for repairing mobile friendly problems in web pages and in a user study, participants preferred the repaired versions of the subjects and considered the repaired pages to be more readable than the originals.

...read moreread less

Abstract: Mobile devices have become a primary means of accessing the Internet. Unfortunately, many websites are not designed to be mobile friendly. This results in problems such as unreadable text, cluttered navigation, and content overflowing a device's viewport; all of which can lead to a frustrating and poor user experience. Existing techniques are limited in helping developers repair these mobile friendly problems. To address this limitation of prior work, we designed a novel automated approach for repairing mobile friendly problems in web pages. Our empirical evaluation showed that our approach was able to successfully resolve mobile friendly problems in 95% of the evaluation subjects. In a user study, participants preferred our repaired versions of the subjects and also considered the repaired pages to be more readable than the originals.

...read moreread less

Web Crawler Architecture.

[...]

Marc Najork

01 Jan 2018

TL;DR: Web crawler as mentioned in this paper is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the Web pages identified by these hyperlinks.

...read moreread less

Abstract: Definition A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web. In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the “frontier” set of yet-to-be-crawled URLs and the set of discovered URLs – typically do not fit into main memory, so efficient disk-based representations need to be used. Finally, the need to be “polite” to content providers and not to overload any particular web server, and a desire to prioritize the crawl towards high-quality pages and to maintain corpus freshness impose additional engineering challenges.

...read moreread less

Posted Content•

TabVec: Table Vectors for Classification of Web Tables.

[...]

Majid Ghasemi-Gol, Pedro Szekely

17 Feb 2018-arXiv: Information Retrieval

TL;DR: TabVec as mentioned in this paper is an unsupervised method to embed tables into a vector space to support classification of tables into categories (entity, relational, matrix, list, and non-data) with minimal user intervention.

...read moreread less

Abstract: There are hundreds of millions of tables in Web pages that contain useful information for many applications. Leveraging data within these tables is difficult because of the wide variety of structures, formats and data encoded in these tables. TabVec is an unsupervised method to embed tables into a vector space to support classification of tables into categories (entity, relational, matrix, list, and non-data) with minimal user intervention. TabVec deploys syntax and semantics of table cells, and embeds the structure of tables in a table vector space. This enables superior classification of tables even in the absence of domain annotations. Our evaluations in four real world domains show that TabVec improves classification accuracy by more than 20% compared to three state of the art systems, and that those systems require significant in domain training to achieve good results.

...read moreread less

Journal Article•DOI•

Keyword query based focused Web crawler

[...]

Manish Kumar¹, Ankit Bindal¹, Robin Gautam¹, Rajesh Bhatia¹•Institutions (1)

PEC University of Technology¹

01 Jan 2018-Procedia Computer Science

TL;DR: A query based crawler where a set of keywords relevant to the topic of interest of the user is used to shoot queries on search interface to give the most relevant information based on the keywords in a particular domain without actually crawling through many irrelevant links in between them.

...read moreread less

Journal Article•DOI•

Evaluation of Web content accessibility in an Israeli institution of higher education

[...]

Hila Laufer Nir¹, Arie Rimmerman¹•Institutions (1)

University of Haifa¹

12 Apr 2018-Universal Access in The Information Society

TL;DR: The aim of this research is to explore the current implementation of Web accessibility in the Israeli higher education context, during a period of evolving legal changes in this regard.

...read moreread less

Abstract: Nowadays, the Web constitutes an integral part of higher education and offers an unprecedented level of access to information and services. The increasing number of students with disabilities in higher education emphasizes the need of universities and colleges to make the necessary adjustments to ensure their Web content accessibility. Despite the development of technical standards and accessibility legislation, studies around the world have consistently shown that Web content accessibility remains a concern in higher education. Mandatory Web accessibility in Israel is at an early stage. The scope of the legal requirements applicable to higher education is not entirely resolved. The aim of this research is to explore the current implementation of Web accessibility in the Israeli higher education context, during a period of evolving legal changes in this regard. An automated evaluation tool was used to measure the adherence of the sample Web pages to the technical standards. Results show that all examined Web pages presented accessibility barriers and were non-compliant with the most basic conformance level. “Contrast” and “missing alternative text” errors were the most frequent problems identified in the evaluation. The library’s Web pages exhibit relatively better level of accessibility compared to the other examined Web pages of the university. The research highlights the need for clear and enforceable legislation to encourage academic Web accessibility. Additionally, technical training and awareness raising could be key elements in improving accessibility.

...read moreread less

Proceedings Article•DOI•

Verifying that web pages have accessible layout

[...]

Pavel Panchekha¹, Adam T. Geller¹, Michael D. Ernst¹, Zachary Tatlock¹, Shoaib Kamil² - Show less +1 more•Institutions (2)

University of Washington¹, Adobe Systems²

11 Jun 2018

TL;DR: VizAssert introduces visual logic to precisely specify accessibility properties, formalizes a large fragment of the browser rendering algorithm using novel finitization reductions, and provides a sound, automated tool for verifying assertions in visual logic.

...read moreread less

Abstract: Usability and accessibility guidelines aim to make graphical user interfaces accessible to all users, by, say, requiring that text is sufficiently large, interactive controls are visible, and heading size corresponds to importance. These guidelines must hold on the infinitely many possible renderings of a web page generated by differing screen sizes, fonts, and other user preferences. Today, these guidelines are tested by manual inspection of a few renderings, because 1) the guidelines are not expressed in a formal language, 2) the semantics of browser rendering are not well understood, and 3) no tools exist to check all possible renderings of a web page. VizAssert solves these problems. First, it introduces visual logic to precisely specify accessibility properties. Second, it formalizes a large fragment of the browser rendering algorithm using novel finitization reductions. Third, it provides a sound, automated tool for verifying assertions in visual logic. We encoded 14 assertions drawn from best-practice accessibility and mobile-usability guidelines in visual logic. VizAssert checked them on on 62 professionally designed web pages. It found 64 distinct errors in the web pages, while reporting only 13 false positive warnings.

...read moreread less

Collapse