scispace - formally typeset
Search or ask a question

User-aware page classification in a search engine

TL;DR: This paper looks into the use of 46 linguistic features to classify texts according to genres and text types, and employs the same features to train a classifier that decides which possible user need(s) a Web page may satisfy.
Abstract: In this paper we investigate the hypothesis that classification of Web pages according to the general user intentions is feasible and useful. As a preliminary study we look into the use of 46 linguistic features to classify texts according to genres and text types; we then employ the same features to train a classifier that decides which possible user need(s) a Web page may satisfy. We also report on experiments for customizing searching systems with the same set of features to train a classifier that helps users discriminate among their specific needs. Finally, we describe some user input that makes us confident on the utility of the approach.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
01 Jan 2006
TL;DR: This paper is concerned with the non-trivial rela-tionship between reference to place in natural language (NL) and common GIR assumptions, and the two main ways in which NL texts and GIR meet.
Abstract: Let us define geographical IR (GIR) as the activity whosepurpose is to retrieve information in a geographically-awareway. In other words, considering the geographical dimensionas special. GIR presupposes two things:• the possibility to associate to (possibly retrieve from)the collection geographical information• the existence (or the possibility of creation) of semanticrepositories that allow geographical reasoning, hence-forth called geo-ontologies.The most common kind of collections for GIR so far arethe Web and other document collections, which are mainlytextual. This paper is concerned with the non-trivial rela-tionship between reference to place in natural language (NL)and common GIR assumptions. There are two main ways inwhich NL texts and GIR meet: in the attempt to derive orpopulate geo-ontologies from text itself, and in the attemptto label Web pages with what is called geo-scopes, derivingthese from clues in the pages themselves.We will survey briefly the two, noting in passing that bothapproaches are bottom-up in the sense that they look atthe texts, but the second makes use of a prior informationsource, a geo-ontology, which is typically top-down (see Geo-Net-PT01 in Table 1).

19 citations

Book ChapterDOI
Luís Costa1
21 Sep 2005
TL;DR: How the Esfinge question answering system works is described, the results obtained by the official runs in considerable detail are presented, as well as results of experiments measuring the import of different parts of the system, by reporting the decrease in performance when the system is executed without some of its components/features.
Abstract: Esfinge is a general domain Portuguese question answering system. It tries to take advantage of the steadily growing and constantly updated information freely available in the World Wide Web in its question answering tasks. The system participated last year for the first time in the monolingual QA track. However, the results were compromised by several basic errors, which were corrected shortly after. This year, Esfinge participation was expected to yield better results and allow experimentation with a Named Entity Recognition System, as well as try a multilingual QA track for the first time. This paper describes how the system works, presents the results obtained by the official runs in considerable detail, as well as results of experiments measuring the import of different parts of the system, by reporting the decrease in performance when the system is executed without some of its components/features.

10 citations

01 Jan 2005
TL;DR: A corpus of webpages, named “Yes, user!”, which was classified in order to satisfy different types of users' needs and the process used to build this corpus is described.
Abstract: This paper describes a corpus of webpages, named “Yes, user!”. These pages were classified in order to satisfy different types of users' needs. We introduce the assumptions on which the corpus is based, show its classification scheme in detail, and describe the process used to build this corpus. We also present the results of a questionnaire inquiring about the general clarity and understanding of our classification and those proposed by other researchers. We describe both the corpus and a metasearch prototype which was built with those classifiers and make it accessible for other researchers to use.

6 citations


Cites methods from "User-aware page classification in a..."

  • ...The first experiments are described in Aires et al. (2004), more advanced ones in Aires et al. (2005) and Aires (forthcoming)....

    [...]

  • ...…is important to emphasize that both corpora were used to train classifiers with a set of shallow parsing features (inspired by Biber’s work (1988)) and lexical general-content words, thus comparing several machine learning techniques, as reported in Aires et al. (2005) and Aires (forthcoming)....

    [...]

DissertationDOI
21 Sep 2005
TL;DR: This thesis addresses the problem of information overflow users face when dealing with Web search results and results show that the user needs classification is understood by the user, and search can be eased by classifying search results according to user needs.
Abstract: How should one cope with information overflow, when there are too many pages on the Web about almost every subject? This thesis addresses the problem of information overflow users face when dealing with Web search results. To go beyond content, it is proposed to classify pages according to the search goals they serve from a user point of view: to download a system, learn some subject or find news about another are quite different user goals. The hypothesis validated in the present dissertation is that it is both technically feasible and understandble to classify Web pages according to user goal. By using machine learning techniques over linguistically inspired features, automatic classifiers were built to distinguish among user needs. Also, several user studies were conducted to assess the understandability of the concepts at stake and the gain achieved by using the particular classification in the display of the results. In addition, this work also tested personalized binary classifiers about specific subjects, trained in small training corpora supplied by the users themselves. With regard to evaluation, both system evaluation and user-centered evaluation were performed. The results show that (i) the user needs classification is understood by the user, (ii) the use of style markers are a reliable path to be investigated (iii) training on small Web corpora is able to generate reliable classifiers, and (iv) search can be eased by classifying search results according to user needs.

3 citations


Cites background from "User-aware page classification in a..."

  • ...(Aires et al, 2005c) Aires....

    [...]

  • ...SIGIR, agosto de 2005, Salvador - Brasil, 8 p. (Aires et al, 2005b) Aires, R.; Santos, D.; Aluísio....

    [...]

  • ...Além disso, foram também dadas instruções sobre cada tipo de necessidade, com exemplos e contra-exemplos de textos (Aires et al. 2005b)....

    [...]

  • ...1 3 9 11.3.4 Uso de marcadores estilísticos para a classificação em necessidades de textos em outras línguas Em Aires et al (2005a) apresenta-se um experimento para a classificação de textos de direito em inglês em textos para leigos ou para especialistas....

    [...]

  • ...(Aires et al. 2005a) Aires, R.; Aluísio, A.; Santos....

    [...]

01 Jan 2005
TL;DR: Alguns dos sistemas disponíveis na Rede e alguns dos eventos que procuram medir os avanços na área da resposta automática a perguntas são referem-se algumas considerações sobre a utilidade oficial sobre o estado da arte nesta áre para o português.
Abstract: Este artigo começa por dar uma breve panorâmica da área da resposta automática a perguntas. Referem-se alguns dos sistemas disponíveis na Rede e alguns dos eventos que procuram medir os avanços na área. De seguida descreve-se o sistema de resposta automática a perguntas em português Esfinge e os resultados obtidos até ao momento pelo mesmo. Terminam-se com algumas considerações sobre a utilidade de sistemas de resposta automática a perguntas e o estado da arte nesta área para o português.

2 citations


Cites background from "User-aware page classification in a..."

  • ...Esta lista foi criada manualmente, mas em experiências futuras poder-se-ão usar técnicas mais sofisticadas para classificar as páginas Web [Aires et al, 2005] ....

    [...]

  • ...Terminam-se com algumas considerações sobre a utilidade de sistemas de resposta automática a perguntas e o estado da arte nesta área para o português....

    [...]

References
More filters
Proceedings ArticleDOI
08 Feb 1999
TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

5,506 citations

01 Jan 1999
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.

5,350 citations


"User-aware page classification in a..." refers methods in this paper

  • ...SMO implements Platt’s [13] sequential minimal optimisation algorithm for training a support vector classifier using scaled polynomial kernels, transforming the output of SVM into probabilities by applying a standard sigmoid function that is not fitted to the data....

    [...]

Book
John Platt1
08 Feb 1999
TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Abstract: This chapter describes a new algorithm for training Support Vector Machines: Sequential Minimal Optimization, or SMO Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) optimization problem SMO breaks this large QP problem into a series of smallest possible QP problems These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets Because large matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while a standard projected conjugate gradient (PCG) chunking algorithm scales somewhere between linear and cubic in the training set size SMO's computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets For the MNIST database, SMO is as fast as PCG chunking; while for the UCI Adult database and linear SVMs, SMO can be more than 1000 times faster than the PCG chunking algorithm

5,019 citations

Book
01 Jan 1988
TL;DR: The model applied in this study addressed textual dimensions and relations in speech and writing, as well as situations and functions, and its application to linguistic research on speech andWriting.
Abstract: Part I. Background Concepts and Issues: 1. Introduction: textual dimensions and relations 2. Situations and functions 3. Previous linguistic research on speech and writing Part II. Methodology: 4. Methodological overview of the study 5. Statistical analysis Part III. Dimensions and Relations in English: 6. Textual dimensions in speech and writing 7. Textual relations in speech and writing 8. Extending the description: variations within genres 9. Afterword: applying the model Appendices.

2,891 citations


"User-aware page classification in a..." refers background in this paper

  • ...For comparison, note that Biber s 481 texts amounted to a corpus with approximately 960,000 words, which is larger in number of words because Web texts tend to be smaller....

    [...]

  • ...These features, which are mainly closed lists, were inspired by those proposed by Biber [10] and Karlgren [8], but checked in grammars and textbooks for Portuguese....

    [...]

  • ...[10] Biber, D.: Variation across speech and writing....

    [...]

  • ...Karlgren concluded that most users used the interface as intended and many searched for documents in the genres the results could be expected to show up in. Biber [10] has studied English text variation using several variables, and found that texts vary along five dimensions....

    [...]

  • ...We have also trained a classification scheme for texts in English using: (i) a corpus with 200 texts extracted from www.findlaw.com; (ii) the algorithms J48, SMO and LMT and (iii) 52 features taken from Biber and Karlgren [1, 2] which are the original features for English that were adapted for Portuguese (Figure 1) plus 3 types of modals, 2 of negation, nominalizations, besides reflexive and possessive pronouns....

    [...]

Journal ArticleDOI
Andrei Z. Broder1
01 Sep 2002
TL;DR: This taxonomy of web searches is explored and how global search engines evolved to deal with web-specific needs is discussed.
Abstract: Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.

2,094 citations


"User-aware page classification in a..." refers methods in this paper

  • ...[4] Broder, A. "A Taxonomy of Web Search", SIGIR Forum 36 (2), Fall 2002, p.3-10....

    [...]

  • ...Inspired by Broder [4] and previous work in detecting user s goals in Web search [5], we devised a user need typology from a qualitative analysis of the TodoBr logs....

    [...]

  • ...Inspired by Broder [4] and previous work in detecting user’s goals in Web search [5], we devised a user need typology from a qualitative analysis of the TodoBr logs....

    [...]