scispace - formally typeset
Open AccessJournal ArticleDOI

A Gray code based ordering for documents on shelves: Classification for browsing and retrieval

TLDR
A classification system is proposed that can classify documents without human intervention and can incorporate both classification by subject and by other forms of bibliographic information, allowing for the generalization of browsing to include all features of an information carrying unit.
Abstract
A document classifier places documents together in a linear arrangement for browsing or high-speed access by human or computerized information retrieval systems. Requirements for document classification and browsing systems are developed from similarity measures, distance measures, and the notion of subject aboutness. A requirement that documents be arranged in decreasing order of similarity as the distance from a given document increases can often not be met. Based on these requirements, information-theoretic considerations, and the Gray code, a classification system is proposed that can classify documents without human intervention. It provides a theoretical justification for individual classification numbers going from broad to narrow topics when moving from left to right in the classification number. A general measure of classifier performance is developed, and used to evaluate experimental results comparing the distance between subject headings assigned to documents given classifications from the proposed system and the Library of Congress Classification (LCC) system. Browsing in libraries, hypertext, and databases is usually considered to be the domain of subject searches. The proposed system can incorporate both classification by subject and by other forms of bibliographic information, allowing for the generalization of browsing to include all features of an information carrying unit. © 1992 John Wiley & Sons, Inc.

read more

Content maybe subject to copyright    Report

A Gray Code Based Ordering
for Documents on Shelves:
Classification for Browsing and Retrieval
Journal of the American Society for Information
Science
43(4) 1992, 312–322.
Robert M. Losee
University of North Carolina
Chapel Hill, NC 27599-3360 U.S.A.
losee@ils.unc.edu
June 28, 1998
Abstract
A document classifier places documents together in a linear arrangement
for browsing or high speed access by human or computerized information
retrieval systems. Requirements for document classification and browsing
systems are developed from similarity measures, distance measures, and the
notion of subject aboutness. A requirement that documents be arrangedin de-
creasing order of similarity as the distance from a given document increases
can often not be met. Based on these requirements, information theoretic
considerations, and the Gray code, a classification system is proposed that
can classify documents without human intervention. It provides a theoretical
justification for individual classification numbers going from broad to narrow
topics when moving from left to right in the classification number. A general
measure of classifier performance is developed and used to evaluate experi-
mental results comparing the distance between subject headings assigned to
documents given classifications from the proposed system and the Library
of Congress Classification (LCC) system. Browsing in libraries, hypertext,
and databases is usually considered to be the domain of subject searches.
The proposed system can incorporate both classification by subject and by
other forms of bibliographic information, allowing for the generalization of
browsing to include all features of an information carrying unit.
1

1 Introduction
Mentioning document classification may bring to a layperson’s mind the Dewey
Decimal system, while to the library professional, it usually provokes thoughts
of the problems associated with bringing together similar materials, the costs of
cataloging, or the difficulty of justifying one classification system over another.
Svenonius (1981) has suggested that “the main use for classification, at least in the
United States, has been to facilitate browsing. While classification has other roles
in the library, from the mundane to playing “a direct role in the creation of orig-
inal knowledge” [6], classification as a tool to support browsing activities in both
libraries and database systems will be the focus of this research and discussion.
Although numerous library classification systems have been developed, most have
been developed from philosophical and taxonomical considerations [12, 28]; few
have been based on more scientific criteria. A set of precise requirements for a
document classification system are provided here as well as a classification system
consistent with these requirements. This classification system and an associated
measure of classification performance have been developed based on concepts used
by information professionals who use document classification systems.
Many different quantitative methods are available for determining similarities
between documents and for optimal ordering for documents. Methods similar to
the work here include studies of coding techniques that place similar documents
near each other, given a query [10], and hashing techniques that maintain order
[13]. An information theoretic method is developed here which we believe is easier
to interpret and consistent with the needs of document classification systems. Other
similarity techniques and measures could be used and need to be explored, e.g., the
expected mutual information measure. The method developed here for evaluating
a classification system is based on information naturalness concerns. Performance
measures using statistical correlation could be satisfactorily used, measuring, for
example, the correlation between the ordering provided by the classification system
and the ordering provided by a perfectly ordered collection of documents.
Aristotle suggested that a science has as its base a series of predicates that, in
effect, define the science. For example, the science of physics uses fundamental
predicates such as position, velocity, and inertia on which it builds its theories.
If a science of document classification is to be developed, it will probably have
at its base predicates like the shelf-distance between documents and the subject-
similarity of documents.
Developing a document classification system results in a rather unique set of
problems. The primary function of a document or library classification system is to
assign a value to each information carrying unit so that when the items are sorted
by this value, like information carrying units are grouped together. Co-location of
2

documents allows users to benefit from browsing through nearby similar materials
once they have located a single potentially relevant item, increasing the precision of
the browsing. To provide for browsing capabilities, a library classification system
should meet several requirements. Broad requirements for a classification system
have been suggested by Wynar and Taylor [28]. We believe that a classification
system supporting browsing should
assign classification values objectively (objectivity requirement),
provide a single classification system capable of classifying all possible doc-
uments (inclusion requirement),
provide a linear structure (linearity requirement),
assign values to documents so that when one moves away from any docu-
ment in either direction on a conceptual shelf or in a database, the docu-
ments become increasingly dissimilar (increasing distance-dissimilarity re-
quirement).
A classification system meeting these requirements will fulfill the needs of a li-
brarian or database manager wishing to place like items together for browsing. In a
library, for example, similar documents may be consecutively retrieved by retriev-
ing books as one moves down a shelf.
Other characteristics of a classification system, while desirable, are not manda-
tory. A classification system might have the following characteristics:
be easily (quickly) searched,
be easy for librarians to use when classifying documents,
allow for classification by computer,
be consistent with an existing, popular system,
provide explanatory power, or
be readily adaptable to incorporate changes in the materials classified.
These requirements and desirable characteristics, when combined with the pred-
icates of a science of classification, will be used as the basis for the proposed
classification system. A guide to the automatic classification literature is provided
by [18].
A document or book has a number of features which may characterize the topic
of the document. Subject indicating features may include library subject headings,
3

while other features may indirectly indicate the document’s subject, such as words
that occur in a document’s title, the language, date and place of publication, and
characteristics of the author that might provide information about the subject of the
author’s work.
Features are assumed here to be binary. A feature is thus present or absent, or
, depending on the degree to which the document is about the feature. Below, the
expression “the probability of a feature” refers to the probability that the feature
has the value . Features are assumed to be treatable as statistically independent.
The primarypurpose of a classificationsystem is to enable informationsearchers
to browse through documents. Once an initial document has been located on a
shelf in a library or in a window in a hypertext system, searchers often choose to
examine related materials [2, 5, 8, 22, 23]. Classification provides this clustering
of similar materials [9, 17]. Classification and browsing systems typically group
items by subject similarity, but clustering procedures may also take into consid-
eration bibliographic features not normally thought of as subject related, such as
type of binding. Although the classification system discussed below is capable of
incorporating bibliographic features used in known item searches and is not limited
to conventional subject clustering, further research will be necessary to determine
how useful classification by other than subject-features would be to library patrons
and database searchers.
Below, a Gray code based classification system is used to group a set of docu-
ments together. The classification procedure groups documents based on the doc-
uments’ subject-bearing features. A measure is proposed which can be used to
evaluate the performance of a classification system. Experiments suggest that the
Gray code based classification method places documents closer together than does
the Library of Congress Classification system.
2 Shelf-Distance
For notational simplification, a classification system is understood as an ordering
of a series of documents or text fragments on a single conceptual shelf holding
documents. A document or book is denoted as , with the subscript indicating the
position of the document on the shelf. The documents are ordered
where the subscript indicates the position of the document relative to the leftmost
document on the shelf, with position representing an arbitrarily chosen position
under consideration. is a hypothetical document with no subject content of any
sort at the left end of the shelf.
4

Distance measures in most common geometric spaces, such as a conceptual
document shelf, must meet several criteria, including:
where the distance between and is denoted as for all and .
It is always the case that for all , , and when
or when . Unless indicated otherwise, it is assumed that
the distance function, as well as other functions, are only defined over the set of
documents to the right of and over the set of documents to the left of but
not over the set of all documents taken together. Therefore, the distance between
and is not defined if is between and .
3 Distance and Dissimilarity
The dissimilarity function, or distance in a conceptual subject space between two
documents and , is denoted as . The subject of a document, what
it is about, is considered to be determined by all subject related aspects or features
of the document. The subject-dissimilarity may be computed as a function of the
degree of difference in feature values between two documents.
Because the dissimilarity function may be understood as a distance measure in
a conceptual space, the following holds:
A classification system may attempt to maximize or minimize factors combin-
ing the distances between documents both in physical space (shelf-distance) and
subject similarity or dissimilarity between documents. Numerous techniques for
combining distance and similarity measures are used in mathematical clustering
procedures [1].
A classification function consistent with the requirements for a document clas-
sification system, and in particular, the increasing distance-dissimilarity require-
ment, mandates that items be placed in weakly ascending order by the value of a
subject-dissimilarity measure as one moves out in either direction from any given
document. Weakly ascending order implies that the value for a selected feature
5

Citations
More filters
Journal Article

The Gray Code.

TL;DR: The properties and algorithms of the Gray code are summarised, and some interesting applications of the code are also treated.
Journal ArticleDOI

Constant-Weight Gray Codes for Local Rank Modulation

TL;DR: It is suggested that constant-weight Gray codes for the local rank-modulation scheme should be studied in order to simulate conventional multilevel flash cells while retaining the benefits of rank modulation.
Journal ArticleDOI

The structure of single-track Gray codes

TL;DR: An iterative construction for binary single-track Gray codes which are asymptotically optimal if an infinite family of asymptic optimal seed-codes exists and is based on an effective way to generate a large set of distinct necklaces and a merging method for cyclic Gray codes based on necklace representatives.
Proceedings ArticleDOI

Market basket analysis of library circulation data

TL;DR: The a-priori market basket tool is applied to the task of detecting subject classification categories that co-occur in transaction records of books borrowed from a university library to provide insight into the degree of "scatter" that the classification scheme induces in a particular collection of documents.
Journal ArticleDOI

Chip-integrated all-optical 4-bit Gray code generation based on silicon microring resonators.

TL;DR: By combining two independent resonant wavelengths of two MRRs and adjusting their powers in a certain order, all-optical 4-bit Gray code generation has been successfully demonstrated and the proposed integrated device is competent in on-chip all- optical communication and optical interconnection systems with significant advantages.
Frequently Asked Questions (12)
Q1. What are the contributions in "A gray code based ordering for documents on shelves: classification for browsing and retrieval journal of the american society for information science 43(4) 1992, 312–322" ?

In this paper, a classification system based on the Gray code is proposed to classify documents without human intervention, and a general measure of classifier performance is developed and used to evaluate experimental results comparing the distance between subject headings assigned to documents given classifications. 

One difficulty associated with this method of analysis is that distances between documents are computed based solely on their subject headings. 

The proposed classification system, consistent with a set of requirements for a classification system, can be used effectively to classify documents in library and database environments. 

The five databases used consist of bibliographic records retrieved from searches of the UNC-Chapel Hill Library’s on-line catalog. 

As the rightmost bits “cycle” more frequently than bits to the left, this arrangement will result in a lower expected dissimilarity between adjacent documents than would be the case if the least probable features with the greatest expected dissimilarity were on the right side and cycled most frequently. 

One of the advantages of using a theoretically based model for a classification system is that classification performance can be analyzed. 

If the first character of an existing classification code has one of twenty six possible characters, the first twenty six binary feature values of the proposed system could be extracted and treated as a single character for comparison. 

The greater increase in performance for the combined database is due in part to the increase in , but is also due to the increased effectiveness of both classification systems when database size increases. 

This document arrangement has the effect of ordering documents so that at any one point on the conceptual shelf, the documents to both the right and the left of the point are arranged in order of increasing dissimilarity to the document at the chosen point. 

The proposed system can incorporate both classification by subject and by incorporation of non-subject related bibliographic features, allowing for the extension of browsing beyond subject searching. 

the expected dissimilarity between documents, as represented by the sum of the expected dissimilarity between features, can be decreased by placing those features with the least expected dissimilarity furthest to the right in the code, while the features with the greatest expected dissimilarity are placed to the left. 

Methods similar to the work here include studies of coding techniques that place similar documents near each other, given a query [10], and hashing techniques that maintain order [13].