scispace - formally typeset
Search or ask a question
Journal Article

Cybergenre: automatic identification of home pages on the web

01 Dec 2004-Journal of Web Engineering (Rinton Press, Incorporated)-Vol. 3, Iss: 3, pp 236-251
TL;DR: Results indicate that the classifier was trained to distinguish home pages from non-home pages and within the home page genre it is able to distinguish personal from corporate home pages, however, organization home pages were more difficult to distinguish from personal and corporateHome pages.
Abstract: The research reported in this paper is part of a larger project on the automatic classification of web pages by their genres. The long term goal is the incorporation of web page genre into the search process to improve the quality of the search results. In this phase, a neural net classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. In order to evaluate the importance of the functionality attribute of cybergenre in such classification, the web pages were characterized by the cybergenre attributes of 〈content, form, functionality〉 and the resulting classifications compared to classifications in which the web pages were characterized by the genre attributes of 〈content, form〉. Results indicate that the classifier is able to distinguish home pages from non-home pages and within the home page genre it is able to distinguish personal from corporate home pages. Organization home pages, however, were more difficult to distinguish from personal and corporate home pages. A significant improvement was found in identifying personal and corporate home pages when the functionality attribute was included.

Content maybe subject to copyright    Report

Citations
More filters
Book
01 Jan 1996

1,170 citations

Proceedings ArticleDOI
03 Jan 2005
TL;DR: In this article, a neural network classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page.
Abstract: The research reported in this paper is the first phase of a larger project on the automatic classification of Web pages by their genres. The long term goal is the incorporation of web page genre into the search process to improve the quality of the search results. In this phase, a neural net classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. Results indicate that the classifier is able to distinguish home pages from non-home pages and within the home page genre it is able to distinguish personal from corporate home pages. Organization home pages, however, were more difficult to distinguish from personal and corporate home pages.

112 citations

Proceedings ArticleDOI
03 Jan 2007
TL;DR: The aim of this paper is to show that Web pages need a zero-to-multi-genre classification scheme, i.e. a scheme that allows zero genre or multi- genres classification, in addition to the traditional single- genre classification.
Abstract: When dealing with genres of Web pages, there are two important aspects to be taken into account. On the one hand, the Web is fluid, unstable and fast-paced. On the other hand, genres on the Web are instantiated in Web pages, which are a complex type of document, more composite and unpredictable than paper documents. These two aspects are interwoven and often result in classification hurdles. In this paper, the author suggests analyzing these classification problems in terms of two broad textual phenomena: genre hybridism and individualization. The identification of these two phenomena helps pinpoint the range of flexibility that an automatic classification system should have. More precisely, genre hybridism accounts for multi-genre variation within the individual Web page, while individualization refers to absence of any recognized genre in a Web page. In a few words, the aim of this paper is to show that Web pages need a zero-to-multi-genre classification scheme, i.e. a scheme that allows zero genre or multi-genre classification, in addition to the traditional single-genre classification

84 citations


Cites background from "Cybergenre: automatic identificatio..."

  • ..., 2004 [34]; Meyer zu Eissen and Stein, 2004 [24]; Lim et al....

    [...]

  • ..., 2004 [34] and Kennedy and Shepherd, 2005 [20]), where they carry out single-label classification of HOME PAGES....

    [...]

  • ...This is clear both from automatic approaches (e.g. Shepherd et al., 2004 [34]; Meyer zu Eissen and Stein, 2004 [24]; Lim et al., 2005 [17]), and from qualitative analyses (e.g. Shepherd and Watters, 1998 [36]; Ashekave and Nielsen, 2005 [1])....

    [...]

  • ...This assumption is clear in their practical experiments (Shepherd et al., 2004 [34] and Kennedy and Shepherd, 2005 [20]), where they carry out single-label classification of HOME PAGES....

    [...]

Dissertation
01 Jun 2006
TL;DR: The authors investigated personality and gender differences in online personal diaries, or blogs, and found that women are more likely to write more in their blogs than men and that women considered high in Agreeableness pay more attention to differences between their extra-linguistic context and that of their audience.
Abstract: This thesis describes a linguistic investigation of individual differences in online personal diaries, or ‘blogs.’ There is substantial evidence of gender differences in language (Lakoff, 1975), and to a lesser extent linguistic projection of personality (Pennebaker & King, 1999). Recent work has investigated these latter differences in the area of computer-mediated communication (CMC), specifically e-mail (Gill, 2004). This thesis employs a number of analytic techniques, both top-down (dictionarybased) and bottom-up (data-driven), in order to explore personality and gender differences in the language of blogs. A corpus was constructed by asking authors to submit a month of text and complete a sociobiographic questionnaire. The corpus consists of over 400,000 words and five-factor personality data (Buchanan, 2001) for 71 subjects. The thesis begins by framing blogs in the context of other genres, both CMC and traditional, in order to show both the distinctiveness and representativeness of the genre. Top-down content analysis techniques are then employed to investigate the relationship between personality and linguistic features. A number of features correlate with each trait, but upon regression, very little variance is explained. Bottom-up techniques are more successful. The corpus was stratified into high, low and neutral personality groups to identify distinctive collocations for each. Returning to the raw personality scores, it becomes clear that even a small amount of n-gram context helps account for much more variance in personality. A measure of contextuality (Heylighen & Dewaele, 2002) shows that authors considered high in Agreeableness pay more attention to differences between their extra-linguistic context and that of their audience. Attention turns to gender, where similar methods are applied to investigate gender differences in language. Many previous findings are confirmed in the blog corpus. In addition, women are found to write more in their blogs than men. More generally, using the British National Corpus, it is shown that women are more contextual, except in the least contextual of genres (academic writing) where there is no difference. The study concludes by confirming that both gender and personality are projected by language in blogs; furthermore, approaches which take the context of language features into account can be used to detect more variation than those which do not.

82 citations

Journal ArticleDOI
TL;DR: The proposed genre tree kernel method provides significantly better detection capabilities than state-of-the-art anti-phishing methods and underscores the importance of considering intention/purpose as a critical dimension for automated credibility assessment.
Abstract: Phishing websites continue to successfully exploit user vulnerabilities in household and enterprise settings. Existing anti-phishing tools lack the accuracy and generalizability needed to protect Internet users and organizations from the myriad of attacks encountered daily. Consequently, users often disregard these tools' warnings. In this study, using a design science approach, we propose a novel method for detecting phishing websites. By adopting a genre theoretic perspective, the proposed genre tree kernel method utilizes fraud cues that are associated with differences in purpose between legitimate and phishing websites, manifested through genre composition and design structure, resulting in enhanced anti-phishing capabilities. To evaluate the genre tree kernel method, a series of experiments were conducted on a testbed encompassing thousands of legitimate and phishing websites. The results revealed that the proposed method provided significantly better detection capabilities than state-of-the-art anti-phishing methods. An additional experiment demonstrated the effectiveness of the genre tree kernel technique in user settings; users utilizing the method were able to better identify and avoid phishing websites, and were consequently less likely to transact with them. Given the extensive monetary and social ramifications associated with phishing, the results have important implications for future anti-phishing strategies. More broadly, the results underscore the importance of considering intention/purpose as a critical dimension for automated credibility assessment: focusing not only on the "what" but rather on operationalizing the "why" into salient detection cues.

70 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors propose the notion of genres of organizational communication as a concept useful for studying communication as embedded in social process rather than as the result of isolated rational actions.
Abstract: Drawing on rhetorical theory and structuration, this article proposes genres of organizational communication as a concept useful for studying communication as embedded in social process rather than as the result of isolated rational actions. Genres (e.g. the memo, the proposal, and the meeting) are typified communicative actions characterized by similar substance and form and taken in response to recurrent situations. These genres evolve over time in reciprocal interaction between institutionalized practices and individual human actions. They are distinct from communication media, though media may play a role in genre form, and the introduction of new media may occasion genre evolution. After the genre concept is developed, the article shows how it addresses existing limitations in research on media, demonstrates its usefulness in an extended historical example, and draws implications for future research.

1,365 citations


"Cybergenre: automatic identificatio..." refers background in this paper

  • ...As Yates and Orlikowski [ 18 ] have shown in their study of the evolution of the business letter of the late 19th century into the electronic mail of today, genres evolve over time in response to institutional changes and social pressures....

    [...]

  • ...observable physical and linguistic features ...,” [ 18 ]....

    [...]

Proceedings ArticleDOI
07 Jul 1997
TL;DR: A theory of genres as bundles of facets, which correlate with various surface cues, are proposed, and it is argued that genre detection based on surface cues is as successful as Detection based on deeper structural properties.
Abstract: As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.

425 citations

Proceedings ArticleDOI
05 Aug 1994
TL;DR: In this article, a simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus.
Abstract: A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed.

324 citations

Journal ArticleDOI
TL;DR: It is suggested that Web-site designers consider the genres that are appropriate for their situation and attempt to reproduce or adapt familiar genres.
Abstract: The World Wide Web is growing quickly and being applied to many new types of communications As a basis for studying organizational communications, Yates and Orlikowski (1992; Orlikowski & Yates, 1994) proposed using genres They defined genres as "typified communicative actions characterized by similar substance and form and taken in response to recurrent situations" (Yates & Orlikowski, 1992, p299) They further suggested that communications in a new media would show both reproduction and adaptation of existing communicative genres as well as the emergence of new genres We studied these phenomena on the World Wide Web by examining 1000 randomly selected Web pages and categorizing the type of genre represented Although many pages recreated genres familiar from traditional media, we also saw examples of genres being adapted to take advantage of the linking and interactivity of the new medium and novel genres emerging to fit the unique communicative needs of the audience We suggest that Web-site design

313 citations


"Cybergenre: automatic identificatio..." refers methods in this paper

  • ...One method that has been suggested is to classify web pages by their type of genre and use this information to focus a search more narrowly or to rank search results [8, 13]....

    [...]

Posted Content
TL;DR: This article propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection by deeper structural properties.
Abstract: As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.

239 citations