scispace - formally typeset
Open Access

A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage

Georg Rehm
Reads0
Chats0
TLDR
In this paper, the authors present a systematic analysis of academic Web pages with regard to a special class of digital genres: Web genres. And they introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules.
Abstract
We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic’s Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.

read more

Citations
More filters
Journal ArticleDOI

Motivations for academic web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication

TL;DR: It is concluded that academic web link metrics will be dominated by a range of informal types of scholarly communication, which provides an exciting new window through which to investigate a facet of a previously obscured type of communication activity.
Journal ArticleDOI

Conceptualizing documentation on the web: an evaluation of different heuristic-based models for counting links between university web sites

TL;DR: It was discovered that the domain and directory models were able to successfully reduce the impact of anomalous linking behavior between pairs of Web sites, with the latter being the method of choice.
Book ChapterDOI

Disentangling from babylonian confusion – unsupervised language identification

TL;DR: Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.
Proceedings ArticleDOI

Automatic Identification of Home Pages on the Web

TL;DR: In this article, a neural network classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page.
References
More filters
Journal Article

Extensible Markup Language (XML).

TL;DR: XML is an extremely simple dialect of SGML which is completely described in this document, to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.
Book

Genre Analysis: English in Academic and Research Settings

TL;DR: The authors provides a survey of approaches to various genres of language, and considers these in relation to communication and task-based language learning, as well as examples of different genres and how they can be made accessible through genre analysis.
Journal ArticleDOI

Genre as social action

TL;DR: In this paper, a conception of genre based on conventionalized social motives which are found in recurrent situation-types is proposed, and the thesis is that genre must be conceived in terms of rhetorical action rather than substance or form.