scispace - formally typeset
Search or ask a question

Showing papers on "Web page published in 2005"


Journal ArticleDOI
TL;DR: Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.
Abstract: In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm, named tri-training, is proposed. This algorithm generates three classifiers from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space to be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.

1,067 citations


Journal ArticleDOI
TL;DR: While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities.
Abstract: Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes.

1,067 citations


Proceedings ArticleDOI
15 Aug 2005
TL;DR: This research suggests that rich representations of the user and the corpus are important for personalization, but that it is possible to approximate these representations and provide efficient client-side algorithms for personalizing search.
Abstract: We formulate and study search algorithms that consider a user's prior interactions with a wide variety of content to personalize that user's current Web search. Rather than relying on the unrealistic assumption that people will precisely specify their intent when searching, we pursue techniques that leverage implicit information about the user's interests. This information is used to re-rank Web search results within a relevance feedback framework. We explore rich models of user interests, built from both search-related information, such as previously issued queries and previously visited Web pages, and other information about the user such as documents and email the user has read and created. Our research suggests that rich representations of the user and the corpus are important for personalization, but that it is possible to approximate these representations and provide efficient client-side algorithms for personalizing search. We show that such personalization algorithms can significantly improve on current Web search.

928 citations


Patent
06 Apr 2005
TL;DR: A geographical web browser as discussed by the authors allows a user to navigate a network application such as the Word Wide Web by physically navigating in geographical coordinates by physically displaying web pages in a dashboard computer.
Abstract: A geographical web browser allows a user to navigate a network application such as the Word Wide Web by physically navigating in geographical coordinates. For example, a geographical web browser is implemented in a mobile unit such as a dashboard computer. The mobile unit includes one or more transducers such as antennas and is operative to receive locally broadcast signals or to operate a global positioning system (GPS) receiver. As the mobile unit navigates into different physical localities, different web pages are displayed by the geographical web browser. For example, a user desiring to buy a house can set the web browser to a real estate web page. Instead of clicking on a hyperlink to access web pages of properties in an area, the user drives into a first area and automatically receives web pages relating to homes in that area. When the mobile unit crosses town and enters a second area, a new set of web pages is downloaded relating to properties in the second area. The geographical web browser, methods, apparatus and systems disclosed herein enable improved road-navigation and traffic management, advertisement, and related services.

763 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.
Abstract: This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.

572 citations


Journal ArticleDOI
01 Feb 2005
TL;DR: A circuit analysis is introduced that allows to understand the distribution of the page score, the way different Web communities interact each other, the role of dangling pages (pages with no outlinks), and the secrets for promotion of Web pages.
Abstract: Although the interest of a Web page is strictly related to its content and to the subjective readers' cultural background, a measure of the page authority can be provided that only depends on the topological structure of the Web. PageRank is a noticeable way to attach a score to Web pages on the basis of the Web connectivity. In this article, we look inside PageRank to disclose its fundamental properties concerning stability, complexity of computational scheme, and critical role of parameters involved in the computation. Moreover, we introduce a circuit analysis that allows us to understand the distribution of the page score, the way different Web communities interact each other, the role of dangling pages (pages with no outlinks), and the secrets for promotion of Web pages.

507 citations


Journal ArticleDOI
Pavel Berkhin1
TL;DR: The theoretical foundations of the PageRank formulation are examined, the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability.
Abstract: This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRan...

479 citations


Journal ArticleDOI
TL;DR: This presentation uses OWL to represent the mutual relationships of scientific concepts and their ancillary space, time, and environmental descriptors, with application to locating NASA Earth science data.

447 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.
Abstract: As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains different types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.

386 citations


Journal ArticleDOI
TL;DR: This paper is part of a research project undertaken at Edith Cowan, Wollongong and Sienna Universities, to build an Internet Focused Crawler that uses "Quality" criterion in determining returns to user queries.
Abstract: Introduction--The Big Picture Over the past decade, the [Internet.sup.1]--or World Wide Web (Technically the Internet is a huge collection of networked computers using TCP/IP protocol to exchange data. The World-wide Web (WWW) is in essence only part of this network of computers, however its visible status has meant that conceptually at least, it is often used interchangeably with "Internet" to describe the same thing.)--has established itself as the key infrastructure for information administration, exchange, and publication (Alexander & Tate, 1999), and Internet Search Engines are the most commonly used tool to retrieve that information (Wang, 2001). The deficiency of enforceable standards however, has resulted in frequent information quality problems (Eppler & Muenzenmayer, 2002). This paper is part of a research project undertaken at Edith Cowan, Wollongong and Sienna Universities, to build an Internet Focused Crawler that uses "Quality" criterion in determining returns to user queries. Such a task requires that the conceptual notions of quality be ultimately quantified into Search Engine algorithms that interact with Webpage technologies, eliminating documents that do not meet specifically determined standards of quality. The focus of this paper, as part of the wider research, is on the concepts of Quality in Information and Information Systems, specifically as it pertains to Information and Information Retrieval on the Internet. As with much of the research into Information Quality (IQ) in Information Systems, the term is interchangeable with Data Quality (DQ). What Is Information Quality? Data and Information Quality is commonly thought of as a multi-dimensional concept (Klein, 2001) with varying attributed characteristics depending on an author's philosophical view-point. Most commonly, the term "Data Quality" is described as data that is "Fit-for-use" (Wang & Strong, 1996), which implies that it is relative, as data considered appropriate for one use may not possess sufficient attributes for another use (Tayi & Ballou, 1998). IQ as a series of Dimensions Table 1 summaries 12 widely accepted IQ Frameworks collated from the last decade of IS research. While varied in their approach and application, the frameworks share a number of characteristics regarding their classifications of the dimensions of quality. An analysis of Table 1 reveals the common elements between the different IQ Frameworks. These include such traditional dimensions as accuracy, consistency, timeliness, completeness, accessibility, objectiveness and relevancy. Table 2 provides a summary of the most common dimensions and the frequency with which they are included in the above IQ Frameworks. Each dimension also includes a short definition. IQ in the context of its use In order to accurately define and measure the concept of Information quality, it is not enough to identify the common elements of IQ Frameworks as individual entities in their own right. In fact, Information Quality needs to be assessed within the context of its generation (Shanks & Corbitt, 1999) and intended use (Katerattanakul & Siau, 1999). This is because the attributes of data quality can vary depending on the context in which the data is to be used (Shankar & Watts, 2003). Defining what Information Quality is within the context of the World Wide Web and its Search Engines then, will depend greatly on whether dimensions are being identified for the producers of information, the storage and maintenance systems used for information, or for the searchers and users of information. The currently accepted view of assessing IQ, involves understanding it from the users point of view. Strong and Wang (1997) suggest that quality of data cannot be assessed independent of the people who use data. Applying this commonly to the World Wide Web has its own set of problems. …

376 citations


Patent
20 May 2005
TL;DR: In this article, a tool that enables a user to easily and automatically create a photo gallery of thumbnail images on a web page is presented, where a user selects a group of original images and the tool automatically produces a corresponding group of thumbnail image on the web page, with hyperlinks to the corresponding original images.
Abstract: A tool that enables a user to easily and automatically create a photo gallery of thumbnail images on a Web page. A user selects a group of original images, and the tool automatically produces a corresponding group of thumbnail images on the Web page, with hyperlinks to the corresponding original images. Four predefined templates are included, each defining a different format for the thumbnail images including a vertically oriented gallery, a horizontally oriented gallery, a slide show gallery, and a montage gallery. Captions and descriptive text can also be entered and displayed for the thumbnail images in most of the style galleries. An edit function enables a user to add or delete images to existing galleries and to automatically modify the appearance of a photo gallery by selecting and applying a different template.

Posted Content
TL;DR: Use the web inferface to obtain online either the HP-filtered trend or the HP -filtered deviations from the trend.
Abstract: Use the web inferface to obtain online either the HP-filtered trend or the HP-filtered deviations from the trend.

Proceedings ArticleDOI
10 May 2005
TL;DR: A technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines, and experimental results indicate that this technique can achieve very high extraction accuracy.
Abstract: When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.

Proceedings ArticleDOI
15 Aug 2005
TL;DR: An additional criterion for web page ranking is introduced, namely the distance between a user profile defined using ODP topics and the sets of O DP topics covered by each URL returned in regular web search, and the boundaries of biasing PageRank on subtopics of the ODP are investigated.
Abstract: The Open Directory Project is clearly one of the largest collaborative efforts to manually annotate web pages. This effort involves over 65,000 editors and resulted in metadata specifying topic and importance for more than 4 million web pages. Still, given that this number is just about 0.05 percent of the Web pages indexed by Google, is this effort enough to make a difference? In this paper we discuss how these metadata can be exploited to achieve high quality personalized web search. First, we address this by introducing an additional criterion for web page ranking, namely the distance between a user profile defined using ODP topics and the sets of ODP topics covered by each URL returned in regular web search. We empirically show that this enhancement yields better results than current web search using Google. Then, in the second part of the paper, we investigate the boundaries of biasing PageRank on subtopics of the ODP in order to automatically extend these metadata to the whole web.

Patent
27 Jul 2005
TL;DR: In this paper, a system and methodology to assist users with data access activities and that includes such activities as routine web browsing and/or data access applications is presented. But, the system is limited to one-button access to user's desired web or data source information/destinations in order to mitigate efforts in retrieving and viewing such information.
Abstract: The present invention relates to a system and methodology to assist users with data access activities and that includes such activities as routine web browsing and/or data access applications. A coalesced display or montage of aggregated information is provided that is focused from a plurality of sources to achieve substantially one-button access to user's desired web or data source information/destinations in order to mitigate efforts in retrieving and viewing such information. Past web or other type data access patterns can be mined to predict future browsing sites or desired access locations. A system is provided that builds personalized web portals for associated users based on models mined from past data access patterns. The portals can provide links to web resources as well as embed content from distal (remote) pages or sites producing a montage of web or other type data content. Automated topic classification is employed to create multiple topic-centric views that can be invoked by a user.

Journal ArticleDOI
TL;DR: This article works within the hubs and authorities framework defined by Kleinberg and proposes new families of algorithms, and provides an axiomatic characterization of the INDEGREE heuristic which ranks each node according to the number of incoming links.
Abstract: The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the area of information retrieval on the World Wide Web. The seminal papers of Kleinberg [1998, 1999] and Brin and Page [1998] introduced Link Analysis Ranking, where hyperlink structures are used to determine the relative authority of a Web page and produce improved algorithms for the ranking of Web search results. In this article we work within the hubs and authorities framework defined by Kleinberg and we propose new families of algorithms. Two of the algorithms we propose use a Bayesian approach, as opposed to the usual algebraic and graph theoretic approaches. We also introduce a theoretical framework for the study of Link Analysis Ranking algorithms. The framework allows for the definition of specific properties of Link Analysis Ranking algorithms, as well as for comparing different algorithms. We study the properties of the algorithms that we define, and we provide an axiomatic characterization of the INDEGREE heuristic which ranks each node according to the number of incoming links. We conclude the article with an extensive experimental evaluation. We study the quality of the algorithms, and we examine how different structures in the graphs affect their performance.

Proceedings ArticleDOI
10 May 2005
TL;DR: This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/CongLomerative Double Clustering (A/CDC)---an application of a recently introduced multi-way distributional clustering method.
Abstract: Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC)---an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure.

Journal ArticleDOI
TL;DR: A novel algorithm is achieved by a novel algorithm that precomputes a compact database; using this database, it can serve online responses to arbitrary user-selected personalization and proves that for a fixed error probability the size of the database is linear in the number of web pages.
Abstract: Personalized PageRank expresses link-based page quality around userselected pages in a similar way as PageRank expresses quality over the entire web. Existing personalized PageRank algorithms can, however, serve online queries only for a restricted choice of pages. In this paper we achieve full personalization by a novel algorithm that precomputes a compact database; using this database, it can serve online responses to arbitrary user-selected personalization. The algorithm uses simulated random walks; we prove that for a fixed error probability the size of our database is linear in the number of web pages. We justify our estimation approach by asymptotic worst-case lower bounds: we show that on some sets of graphs, exact personalized PageRank values can only be obtained from a database of size quadratic in the number of vertices. Furthermore, we evaluate the precision of approximation experimentally on the Stanford WebBase graph.

Journal ArticleDOI
TL;DR: A look at how developers are going back to the future by building Web applications using Ajax (Asynchronous JavaScript and XML), a set of technologies mostly developed in the 1990s.
Abstract: Looks at how developers are going back to the future by building Web applications using Ajax (Asynchronous JavaScript and XML), a set of technologies mostly developed in the 1990s. A key advantage of Ajax applications is that they look and act more like desktop applications. Proponents argue that Ajax applications perform better than traditional Web programs. As an example, Ajax applications can add or retrieve new data for a page it is working with and the page will update immediately without reloading.

Proceedings ArticleDOI
31 Oct 2005
TL;DR: This paper proposes to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries, and designs and implements a complete location-based web search engine.
Abstract: There is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text information. However, the index of conventional text search engine is set-oriented, while location information is two-dimensional and in Euclidean space. This brings new research problems on how to efficiently represent the location attributes of web pages and how to combine two types of indexes. In this paper, we propose to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries. Three different combining schemes are studied: (1) inverted file and R*-tree double index, (2) first inverted file then R*-tree, (3) first R*-tree then inverted file. To validate the performance of proposed index structures, we design and implement a complete location-based web search engine which mainly consists of four parts: (1) an extractor which detects geographical scopes of web pages and represents geographical scopes as multiple MBRs based on geographical coordinates; (2) an indexer which builds hybrid index structures to integrate text and location information; (3) a ranker which ranks results by geographical relevance as well as non-geographical relevance; (4) an interface which is friendly for users to input location-based search queries and to obtain geographical and textual relevant results. Experiments on large real-world web dataset show that both the second and the third structures are superior in query time and the second is slightly better than the third. Additionally, indexes based on R*-trees are proven to be more efficient than indexes based on grid structures.

Proceedings ArticleDOI
27 Nov 2005
TL;DR: A new methodology that uses multilinear algebra to elicit more information from a higher-order representation of the hyperlink graph to automatically identify topics in the collection along with the associated authoritative Web pages is proposed and tested.
Abstract: Linear algebra is a powerful and proven tool in Web search. Techniques, such as the PageRank algorithm of Brin and Page and the HITS algorithm of Kleinberg, score Web pages based on the principal eigenvector (or singular vector) of a particular non-negative matrix that captures the hyperlink structure of the Web graph. We propose and test a new methodology that uses multilinear algebra to elicit more information from a higher-order representation of the hyperlink graph. We start by labeling the edges in our graph with the anchor text of the hyperlinks so that the associated linear algebra representation is a sparse, three-way tensor. The first two dimensions of the tensor represent the Web pages while the third dimension adds the anchor text. We then use the rank-1 factors of a multilinear PARAFAC tensor decomposition, which are akin to singular vectors of the SVD, to automatically identify topics in the collection along with the associated authoritative Web pages.

Proceedings ArticleDOI
10 May 2005
TL;DR: A static program analysis that approximates the string output of a program with a context-free grammar is developed that can be used to check various properties of a server-side program and the pages it generates.
Abstract: Server-side programming is one of the key technologies that support today's WWW environment. It makes it possible to generate Web pages dynamically according to a user's request and to customize pages for each user. However, the flexibility obtained by server-side programming makes it much harder to guarantee validity and security of dynamically generated pages.To check statically the properties of Web pages generated dynamically by a server-side program, we develop a static program analysis that approximates the string output of a program with a context-free grammar. The approximation obtained by the analyzer can be used to check various properties of a server-side program and the pages it generates.To demonstrate the effectiveness of the analysis, we have implemented a string analyzer for the server-side scripting language PHP. The analyzer is successfully applied to publicly available PHP programs to detect cross-site scripting vulnerabilities and to validate pages they generate dynamically.

Proceedings Article
01 Jan 2005
TL;DR: This paper performs a large-scale, longitudinal study of the Web, sampling both executables and conventional Web pages for malicious objects, and quantifies the density of spyware, the types of of threats, and the most dangerous Web zones in which spyware is likely to be encountered.
Abstract: Malicious spyware poses a significant threat to desktop security and integrity. This paper examines that threat from an Internet perspective. Using a crawler, we performed a large-scale, longitudinal study of the Web, sampling both executables and conventional Web pages for malicious objects. Our results show the extent of spyware content. For example, in a May 2005 crawl of 18 million URLs, we found spyware in 13.4% of the 21,200 executables we identified. At the same time, we found scripted “drive-by download” attacks in 5.9% of the Web pages we processed. Our analysis quantifies the density of spyware, the types of of threats, and the most dangerous Web zones in which spyware is likely to be encountered. We also show the frequency with which specific spyware programs were found in the content we crawled. Finally, we measured changes in the density of spyware over time; e.g., our October 2005 crawl saw a substantial reduction in the presence of drive-by download attacks, compared with those we detected in May.

Patent
Shumeet Baluja1
29 Jun 2005
TL;DR: In this paper, the authors propose a call-on-select functionality, which allows a user device to automatically dial a telephone number associated with the ad by an advertiser, instead of loading a document (e.g., Web page) for rendering.
Abstract: The serving of one or more ads to a user device considers determined characteristics of a user device, such as whether or not the user device supports telephone calls. At least some ads may include call-on-select functionality. When such an ad is selected (e.g., via a button click), instead of loading a document (e.g., Web page) for rendering, a telephone number associated with the ad by an advertiser can be automatically dialed.

Proceedings ArticleDOI
10 May 2005
TL;DR: Algorithms for detecting link farms automatically are presented by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it, providing a modified web graph to use in ranking page importance.
Abstract: With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.

Proceedings ArticleDOI
10 May 2005
TL;DR: Thresher is described, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web, and which enables a rich semantic interaction with existing web pages, "unwrapping" semantic data buried in the pages' HTML.
Abstract: We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify examples of semantic content by highlighting them in a web browser and describing their meaning. We then use the tree edit distance between the DOM subtrees of these examples to create a general pattern, or wrapper, for the content, and allow the user to bind RDF classes and predicates to the nodes of these wrappers. By overlaying matches to these patterns on standard documents inside the Haystack semantic web browser, we enable a rich semantic interaction with existing web pages, "unwrapping" semantic data buried in the pages' HTML. By allowing end-users to create, modify, and utilize their own patterns, we hope to speed adoption and use of the Semantic Web and its applications.

01 Jan 2005
TL;DR: Unlike OWL-S atomic processes, this work does not use a “pre-condition”, or equivalently, it assumes that the pre-condition is uniformly true, to enable a more uniform treatment of atomic process executions.
Abstract: Remark 1.1: Unlike OWL-S atomic processes, we do not use a “pre-condition”, or equivalently, we assume that the pre-condition is uniformly true. We do this to enable a more uniform treatment of atomic process executions: when a web service invokes an atomic process in Colombo, the invoking service will transition to a new state whether or not the atomic process “succeeds”. Optionally, the designer of the atomic process can include an output boolean variable ‘flag’, which is set to true if the execution “succeeded” and is set to false if the execution “failed”. These are conveniences that simplifies bookkeeping, with no real impact on expressive power. 2

Patent
13 May 2005
Abstract: Phishing detection, prevention, and notification is described. In an embodiment, a messaging application facilitates communication via a messaging user interface, and receives a communication, such as an email message, from a domain. A phishing detection module detects a phishing attack in the communication by determining that the domain is similar to a known phishing domain, or by detecting suspicious network properties of the domain. In another embodiment, a Web browsing application receives content, such as data for a Web page, from a network-based resource, such as a Web site or domain. The Web browsing application initiates a display of the content, and a phishing detection module detects a phishing attack in the content by determining that a domain of the network-based resource is similar to a known phishing domain, or that an address of the network-based resource from which the content is received has suspicious network properties.

Proceedings ArticleDOI
31 Oct 2005
TL;DR: This work demonstrates the usefulness of the uniform resource locator (URL) alone in performing web page classification and shows that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link- based methods.
Abstract: We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is faster than typical web page classification, as the pages do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness on two standardized domains. Our results show that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link-based methods.

Proceedings ArticleDOI
23 Oct 2005
TL;DR: Chickenfoot is described, a programming system embedded in the Firefox web browser, which enables end-users to automate, customize, and integrate web applications without examining their source code.
Abstract: On the desktop, an application can expect to control its user interface down to the last pixel, but on the World Wide Web, a content provider has no control over how the client will view the page, once delivered to the browser. This creates an opportunity for end-users who want to automate and customize their web experiences, but the growing complexity of web pages and standards prevents most users from realizing this opportunity. We describe Chickenfoot, a programming system embedded in the Firefox web browser, which enables end-users to automate, customize, and integrate web applications without examining their source code. One way Chickenfoot addresses this goal is a novel technique for identifying page components by keyword pattern matching. We motivate this technique by studying how users name web page components, and present a heuristic keyword matching algorithm that identifies the desired component from the user's name.