scispace - formally typeset
Search or ask a question

Showing papers on "Web page published in 2004"


Book ChapterDOI
Xin Dong1, Alon Halevy1, Jayant Madhavan1, Ema Nemes1, Jun Zhang1 
31 Aug 2004
TL;DR: Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one, and novel techniques to support these types of searches are described.
Abstract: Web services are loosely coupled software components, published, located, and invoked across the web. The growing number of web services available within an organization and on the Web raises a new and challenging search problem: locating desired web services. Traditional keyword search is insufficient in this context: the specific types of queries users require are not captured, the very small text fragments in web services are unsuitable for keyword search, and the underlying structure and semantics of the web services are not exploited. We describe the algorithms underlying the Woogle search engine for web services. Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one. We describe novel techniques to support these types of searches, and an experimental study on a collection of over 1500 web-service operations that shows the high recall and precision of our algorithms.

828 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: Experimental results show that search systems that adapt to each user's preferences can be achieved by constructing user profiles based on modified collaborative filtering with detailed analysis of user's browsing history in one day.
Abstract: Web search engines help users find useful information on the World Wide Web (WWW). However, when the same query is submitted by different users, typical search engines return the same result regardless of who submitted the query. Generally, each user has different information needs for his/her query. Therefore, the search result should be adapted to users with different information needs. In this paper, we first propose several approaches to adapting search results according to each user's need for relevant information without any user effort, and then verify the effectiveness of our proposed approaches. Experimental results show that search systems that adapt to each user's preferences can be achieved by constructing user profiles based on modified collaborative filtering with detailed analysis of user's browsing history in one day.

782 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: Web-a-Where, a system for associating geography with Web pages that locates mentions of places and determines the place each name refers to, is described and an implementation of the tagger within the framework of the WebFountain data mining system is described.
Abstract: We describe Web-a-Where, a system for associating geography with Web pages. Web-a-Where locates mentions of places and determines the place each name refers to. In addition, it assigns to each page a geographic focus --- a locality that the page discusses as a whole. The tagging process is simple and fast, aimed to be applied to large collections of Web pages and to facilitate a variety of location-based applications and data analyses.Geotagging involves arbitrating two types of ambiguities: geo/non-geo and geo/geo. A geo/non-geo ambiguity occurs when a place name also has a non-geographic meaning, such as a person name (e.g., Berlin) or a common word (Turkey). Geo/geo ambiguity arises when distinct places have the same name, as in London, England vs. London, Ontario.An implementation of the tagger within the framework of the WebFountain data mining system is described, and evaluated on several corpora of real Web pages. Precision of up to 82% on individual geotags is achieved. We also evaluate the relative contribution of various heuristics the tagger employs, and evaluate the focus-finding algorithm using a corpus pretagged with localities, showing that as many as 91% of the foci reported are correct up to the country level.

603 citations


01 Jan 2004
TL;DR: The Web Services Choreography Description Language (WS-CDL) as mentioned in this paper is an XML-based language that describes peer-to-peer collaborations of parties by defining, from a global viewpoint, their common and complementary observable behavior; where ordered message exchanges result in accomplishing a common business goal.
Abstract: 19 20 21 22 23 24 25 26 27 28 29 30 31 The Web Services Choreography Description Language (WS-CDL) is an XML-based language that describes peer-to-peer collaborations of parties by defining, from a global viewpoint, their common and complementary observable behavior; where ordered message exchanges result in accomplishing a common business goal. The Web Services specifications offer a communication bridge between the heterogeneous computational environments used to develop and host applications. The future of E-Business applications requires the ability to perform long-lived, peer-to-peer collaborations between the participating services, within or across the trusted domains of an organization. The Web Services Choreography specification is targeted for composing interoperable, peer-to-peer collaborations between any type of party regardless of the supporting platform or programming model used by the implementation of the hosting environment.

602 citations


Book ChapterDOI
01 Jan 2004
TL;DR: This chapter introduces web services and explains their role in Microsoft’s vision of the programmable web and removes some of the confusion surrounding technical terms like WSDL, SOAP, and UDDI.
Abstract: Microsoft has promoted ASP.NET’s new web services more than almost any other part of the.NET Framework. But despite their efforts, confusion is still widespread about what a web service is and, more importantly, what it’s meant to accomplish. This chapter introduces web services and explains their role in Microsoft’s vision of the programmable web. Along the way, you’ll learn about the open standards plumbing that allows web services to work and removes some of the confusion surrounding technical terms like WSDL (Web Service Description Language), SOAP, and UDDI (universal description, discovery, and integration).

546 citations


Proceedings ArticleDOI
19 May 2004
TL;DR: The weighted PageRank algorithms (WPR), an extension to the standard PageRank algorithm, is introduced, which takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of thepages.
Abstract: With the rapid growth of the Web, users easily get lost in the rich hyper structure. Providing the relevant information to users to cater to their needs is the primary goal of Website owners. Therefore, finding the content of the Web and retrieving the users' interests and needs from their behavior have become increasingly important. Web mining is used to categorize users and pages by analyzing user behavior, the content of the pages, and the order of the URLs that tend to be accessed. Web structure mining plays an important role in this approach. Two page ranking algorithms, HITS and PageRank, are commonly used in Web structure mining. Both algorithms treat all links equally when distributing rank scores. Several algorithms have been developed to improve the performance of these methods. The weighted PageRank algorithm (WPR), an extension to the standard PageRank algorithm, is introduced. WPR takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages. The results of our simulation studies show that WPR performs better than the conventional PageRank algorithm in terms of returning a larger number of relevant pages to a given query.

535 citations


Journal ArticleDOI
TL;DR: In this paper, different techniques for intelligently selecting parts of different order Markov models so that the resulting model has a reduced state complexity, while maintaining a high predictive accuracy are presented.
Abstract: The problem of predicting a user's behavior on a Web site has gained importance due to the rapid growth of the World Wide Web and the need to personalize and influence a user's browsing experience. Markov models and their variations have been found to be well suited for addressing this problem. Of the different variations of Markov models, it is generally found that higher-order Markov models display high predictive accuracies on Web sessions that they can predict. However, higher-order models are also extremely complex due to their large number of states, which increases their space and run-time requirements. In this article, we present different techniques for intelligently selecting parts of different order Markov models so that the resulting model has a reduced state complexity, while maintaining a high predictive accuracy.

532 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time.
Abstract: We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.

511 citations


Proceedings Article
01 Jan 2004
TL;DR: A framework for client-side defense is proposed: a browser plug-in that examines web pages and warns the user when requests for data may be part of a spoof attack.
Abstract: Web spoofing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information We discuss some aspects of common attacks and propose a framework for client-side defense: a browser plug-in that examines web pages and warns the user when requests for data may be part of a spoof attack While the plugin, SpoofGuard, has been tested using actual sites obtained through government agencies concerned about the problem, we expect that web spoofing and other forms of identity theft will be continuing problems in

487 citations


Journal ArticleDOI
TL;DR: In this article, the authors reviewed the literature on computer response time and users' waiting time for download of Web pages, and assessed Web users' tolerable waiting time in information retrieval.
Abstract: Web users often face a long waiting time for downloading Web pages. Although various technologies and techniques have been implemented to alleviate the situation and to comfort the impatient users, little research has been done to assess what constitutes an acceptable and tolerable waiting time for Web users. This research reviews the literature on computer response time and users' waiting time for download of Web pages, and assesses Web users' tolerable waiting time in information retrieval. It addresses the following questions through an experimental study: What is the effect of feedback on users' tolerable waiting time? How long are users willing to wait for a Web page to be downloaded before abandoning it? The results from this study suggest that the presence of feedback prolongs Web users' tolerable waiting time and the tolerable waiting time for information retrieval is approximately 2 s.

480 citations


Journal ArticleDOI
TL;DR: An ontology of time is being developed for describing the temporal content of Web pages and the temporal properties of Web services, which covers topological properties of instants and intervals, measures of duration, and the meanings of clock and calendar terms.
Abstract: In connection with the DAML project for bringing about the Semantic Web, an ontology of time is being developed for describing the temporal content of Web pages and the temporal properties of Web services This ontology covers topological properties of instants and intervals, measures of duration, and the meanings of clock and calendar terms

Patent
22 Jan 2004
TL;DR: In this paper, a mobile deixis device includes a camera to capture an image and a wireless handheld device coupled to the camera and to a wireless network to communicate the image with existing databases to find similar images.
Abstract: A mobile deixis device includes a camera to capture an image and a wireless handheld device, coupled to the camera and to a wireless network, to communicate the image with existing databases to find similar images. The mobile deixis device further includes a processor, coupled to the device, to process found database records related to similar images and a display to view found database records that include web pages including images. With such an arrangement, users can specify a location of interest by simply pointing a camera-equipped cellular phone at the object of interest and by searching an image database or relevant web resources, users can quickly identify good matches from several close ones to find an object of interest.

Proceedings ArticleDOI
Deng Cai1, Xiaofei He2, Zhiwei Li1, Wei-Ying Ma1, Ji-Rong Wen1 
10 Oct 2004
TL;DR: Wang et al. as mentioned in this paper proposed a hierarchical clustering method using visual, textual and link analysis to organize the results into different semantic clusters to facilitate users' browsing, which can be applied to image search results.
Abstract: We consider the problem of clustering Web image search results. Generally, the image search results returned by an image search engine contain multiple topics. Organizing the results into different semantic clusters facilitates users' browsing. In this paper, we propose a hierarchical clustering method using visual, textual and link analysis. By using a vision-based page segmentation algorithm, a web page is partitioned into blocks, and the textual and link information of an image can be accurately extracted from the block containing that image. By using block-level link analysis techniques, an image graph can be constructed. We then apply spectral techniques to find a Euclidean embedding of the images which respects the graph structure. Thus for each image, we have three kinds of representations, i.e. visual feature based representation, textual feature based representation and graph based representation. Using spectral clustering techniques, we can cluster the search results into different semantic clusters. An image search example illustrates the potential of these techniques.

Journal ArticleDOI
TL;DR: The article concludes that organizations willing to embrace the “Wiki way” with collaborative, conversational knowledge management systems, may enjoy better than linear knowledge growth while being able to satisfy ad-hoc, distributed knowledge needs.
Abstract: Wikis (from wikiwiki, meaning “fast” in Hawaiian) are a promising new technology that supports “conversational” knowledge creation and sharing. A Wiki is a collaboratively created and iteratively improved set of web pages, together with the software that manages the web pages. Because of their unique way of creating and managing knowledge, Wikis combine the best elements of earlier conversational knowledge management technologies, while avoiding many of their disadvantages. This article introduces Wiki technology, the behavioral and organizational implications of Wiki use, and Wiki applicability as groupware and help system software. The article concludes that organizations willing to embrace the “Wiki way” with collaborative, conversational knowledge management systems, may enjoy better than linear knowledge growth while being able to satisfy ad-hoc, distributed knowledge needs.

Proceedings ArticleDOI
17 May 2004
TL;DR: PANKOW (Pattern-based Annotation through Knowledge on theWeb), a method which employs an unsupervised, pattern-based approach to categorize instances with regard to an ontology, is proposed.
Abstract: The success of the Semantic Web depends on the availability of ontologies as well as on the proliferation of web pages annotated with metadata conforming to these ontologies. Thus, a crucial question is where to acquire these metadata from. In this paper wepropose PANKOW (Pattern-based Annotation through Knowledge on theWeb), a method which employs an unsupervised, pattern-based approach to categorize instances with regard to an ontology. The approach is evaluated against the manual annotations of two human subjects. The approach is implemented in OntoMat, an annotation tool for the Semantic Web and shows very promising results.

Proceedings ArticleDOI
01 May 2004
TL;DR: The functionality of MEAD is described, a comprehensive, public domain, open source, multidocument multilingual summarization environment that has been thus far downloaded by more than 500 organizations.
Abstract: This paper describes the functionality of MEAD, a comprehensive, public domain, open source, multidocument multilingual summarization environment that has been thus far downloaded by more than 500 organizations. MEAD has been used in a variety of summarization applications ranging from summarization for mobile devices to Web page summarization within a search engine and to novelty detection.


Proceedings ArticleDOI
22 Mar 2004
TL;DR: The results indicate that gender of subjects, the viewing order of a web page, and the interaction between page order and site type influences online ocular behavior.
Abstract: The World Wide Web has become a ubiquitous information source and communication channel. With such an extensive user population, it is imperative to understand how web users view different web pages. Based on an eye tracking study of 30 subjects on 22 web pages from 11 popular web sites, this research intends to explore the determinants of ocular behavior on a single web page: whether it is determined by individual differences of the subjects, different types of web sites, the order of web pages being viewed, or the task at hand. The results indicate that gender of subjects, the viewing order of a web page, and the interaction between page order and site type influences online ocular behavior. Task instruction did not significantly affect web viewing behavior. Scanpath analysis revealed that the complexity of web page design influences the degree of scanpath variation among different subjects on the same web page. The contributions and limitations of this research, and future research directions are discussed.

Patent
19 Oct 2004
TL;DR: In this paper, a system, method and computer program product that combines techniques in the fields of search, data mining, collaborative filtering, user ratings and referral mappings into a system for intelligent web-based help for task or transaction oriented web based systems.
Abstract: A system, method and computer program product that combines techniques in the fields of search, data mining, collaborative filtering, user ratings and referral mappings into a system for intelligent web-based help for task or transaction oriented web based systems. The system makes use of a service oriented architecture based on metadata and web services to locate, categorize and provide relevant context sensitive help, including found help not available when the web based system or application was first developed. As part of the inventive system, there is additionally provided a system for providing an integrated information taxonomy which combines automatically, semi-automatically, and manually generated taxonomies and applies them to help systems. This aspect of the invention is applicable to the fields of online self-help systems for web sites and software applications as well as to customer, supplier and employee help desks.

Proceedings ArticleDOI
17 May 2004
TL;DR: This paper uses a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure, then spatial features and content features are extracted and used to construct a feature vector for each block.
Abstract: Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.

Proceedings ArticleDOI
17 Jun 2004
TL;DR: This paper proposes that some spam web pages can be identified through statistical analysis, and examines a variety of properties, including linkage structure, page content, and page evolution, and finds that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.
Abstract: The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.


Journal ArticleDOI
TL;DR: The ELODie archive contains the complete collection of high‐resolution echelle spectra accumulated over the last decade using the ELODIE spectrograph at the Observatoire de Haute‐Provence 1.93 m telescope.
Abstract: The ELODIE archive contains the complete collection of high‐resolution echelle spectra accumulated over the last decade using the ELODIE spectrograph at the Observatoire de Haute‐Provence 1.93 m telescope. This article presents the different data products and the facilities available on the World Wide Web to reprocess these data on‐the‐fly. Users can retrieve the data in FITS format from the archive Web page (http://atlas.obs‐hp.fr/elodie) and apply to them different functions, wavelength resampling and flux calibration in particular.

Journal ArticleDOI
TL;DR: This article applied a latent class modeling approach to segment web shoppers, based on their purchase behavior across several product categories, and then profile the segments along the twin dimensions of demographics and benefits sought.

Patent
16 Sep 2004
TL;DR: In this article, a basic architecture for managing digital identity information in a network such as the World Wide Web is provided, where a user can organize his or her information into one or more profiles which reflect the nature of different relationships between the user and other entities, and grant or deny each entity access to a given profile.
Abstract: A basic architecture for managing digital identity information in a network such as the World Wide Web is provided. A user of the architecture can organize his or her information into one or more profiles which reflect the nature of different relationships between the user and other entities, and grant or deny each entity access to a given profile. Various enhancements which may be provided through the architecture are also described, including tools for filtering email, controlling access to user web pages, locating other users and making one's own location known, browsing or mailing anonymously, filling in web forms automatically with information already provided once by hand, logging in automatically, securely logging in to multiple sites with a single password and doing so from any machine on the network, and other enhancements.

Proceedings ArticleDOI
17 May 2004
TL;DR: This paper analytically estimates how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results and shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.
Abstract: Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.

Proceedings ArticleDOI
13 Nov 2004
TL;DR: A novel iterative reinforced algorithm to utilize the user click-through data to improve search performance and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above.
Abstract: The performance of web search engines may often deteriorate due to the diversity and noisy information contained within web pages. User click-through data can be used to introduce more accurate description (metadata) for web pages, and to improve the search performance. However, noise and incompleteness, sparseness, and the volatility of web pages and queries are three major challenges for research work on user click-through log mining. In this paper, we propose a novel iterative reinforced algorithm to utilize the user click-through data to improve search performance. The algorithm fully explores the interrelations between queries and web pages, and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above. Experiment results on a large set of MSN click-through log data show a significant improvement on search performance over the naive query log mining algorithm as well as the baseline search engine.

Journal ArticleDOI
TL;DR: Teaching on the Web involves more than putting together a colorful webpage and by consistently employing principles of effective learning, educators will unlock the full potential of Web-based medical education.
Abstract: OBJECTIVE: Online learning has changed medical education, but many “educational” websites do not employ principles of effective learning. This article will assist readers in developing effective educational websites by integrating principles of active learning with the unique features of the Web. DESIGN: Narrative review. RESULTS: The key steps in developing an effective educational website are: Perform a needs analysis and specify goals and objectives; determine technical resources and needs; evaluate preexisting software and use it if it fully meets your needs; secure commitment from all participants and identify and address potential barriers to implementation; develop content in close coordination with website design (appropriately use multimedia, hyperlinks, and online communication) and follow a timeline; encourage active learning (self-assessment, reflection, self-directed learning, problem-based learning, learner interaction, and feedback); facilitate and plan to encourage use by the learner (make website accessible and user-friendly, provide time for learning, and motivate learners); evaluate learners and course; pilot the website before full implementation; and plan to monitor online communication and maintain the site by resolving technical problems, periodically verifying hyperlinks, and regularly updating content. CONCLUSION: Teaching on the Web involves more than putting together a colorful webpage. By consistently employing principles of effective learning, educators will unlock the full potential of Web-based medical education.

Patent
01 Jul 2004
TL;DR: A web server computer system includes a virus checker and mechanisms for checking e-mails and their attachments, downloaded files, and web sites for possible viruses as discussed by the authors, which allows a web server to perform virus checking of different types of information real-time as the information is requested by a web client.
Abstract: A web server computer system includes a virus checker and mechanisms for checking e-mails and their attachments, downloaded files, and web sites for possible viruses. The virus checker allows a web server to perform virus checking of different types of information real-time as the information is requested by a web client. In addition, a web client may also request that the server perform virus checking on a particular drive on the web client. If this case, the web server may receive information from the web client drive, scan the information for viruses, and inform the web client whether any viruses were found. In the alternative, the web server may download a client virus checker to the web client and cause the client virus checker to be run on the web client. The preferred embodiments thus eliminate the need for virus checking software to be installed on each web client.

Journal ArticleDOI
TL;DR: The paper presents a framework, called positive example based learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing and applies an algorithm, called mapping-convergence (M-C), to achieve high classification accuracy as high as that of a traditional SVM.
Abstract: Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. The paper presents a framework, called positive example based learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called mapping-convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples; the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.