scispace - formally typeset
Search or ask a question

Showing papers in "The Computer Journal in 1998"


Journal ArticleDOI
TL;DR: The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model, and the EM result provides a measure of uncertainty about the associated classification of each data point.
Abstract: We consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometric properties are obtained through Gaussian components with different parametrizations and cross-cluster constraints. Noise and outliers can be modelled by adding a Poisson process component. Partitions are determined by the expectation-maximization (EM) algorithm for maximum likelihood, with initial values from agglomerative hierarchical clustering. Models are compared using an approximation to the Bayes factor based on the Bayesian information criterion (BIC); unlike significance tests, this allows comparison of more than two models at the same time, and removes the restriction that the models compared be nested. The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model. Moreover, the EM result provides a measure of uncertainty about the associated classification of each data point. Examples are given, showing that this approach can give performance that is much better than standard procedures, which often fail to identify groups that are either overlapping or of varying sizes and shapes.

2,576 citations


Journal ArticleDOI
TL;DR: The adaptive classifier combination method introduced here performed the best on this seven-class Yahoo news groups problem, which is comparable to the performance of other similar studies.
Abstract: The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.

367 citations


Journal ArticleDOI
TL;DR: A variant of the quadtree structure is adapted to solve the problem of indexing dynamic attributes based on the key idea of using a linear function of time for each dynamic attribute that allows us to predict its value in the future.
Abstract: Dynamic attributes are attributes that change continuously over time making it impractical to issue explicit updates for every change. In this paper, we adapt a variant of the quadtree structure to solve the problem of indexing dynamic attributes. The approach is based on the key idea of using a linear function of time for each dynamic attribute that allows us to predict its value in the future. We contribute an algorithm for regenerating the quadtree-based index periodically that minimizes CPU and disk access cost. We also provide an experimental study of performance focusing on query processing and index update overheads.

218 citations


Journal ArticleDOI
TL;DR: Genetic algorithms have been used successfully to generate software test data automatically to give 100% branch coverage in up to two orders of magnitude fewer tests than random testing.
Abstract: Genetic algorithms have been used successfully to generate software test data automatically; all branches were covered with substantially fewer generated tests than simple random testing. We generated test sets which executed all branches in a variety of programs including a quadratic equation solver, remainder, linear and binary search procedures, and a triangle classifier comprising a system of five procedures. We regard the generation of test sets as a search through the input domain for appropriate inputs. The genetic algorithms generated test data to give 100% branch coverage in up to two orders of magnitude fewer tests than random testing. Whilst some of this benefit is offset by increased computation effort, the adequacy of the test data is improved by the genetic algorithm's ability to generate test sets which are at or close to the input subdomain boundaries. Genetic algorithms may be used for fault-based testing where faults associated with mistakes in branch predicates are revealed. The software has been deliberately seeded with faults in the branch predicates (i.e. mutation testing), and our system successfully killed 97% of the mutants.

116 citations


Journal ArticleDOI
TL;DR: This paper makes the case for a spotting computation scheme which gives rise to a new classification methodology for processing real world data by surveying algorithms developed under the Real World Computing program and related work in Japan.
Abstract: This paper makes the case for a spotting computation scheme which gives rise to a new classification methodology for processing real world data by surveying algorithms developed under the Real World Computing (RWC) program and related work in Japan. A spotting function has the segmentation-free characteristic which ignores gracefully most real world input data which do not belong to a task domain. Some members of the family of spotting methods have been developed under the RWC program. This paper shows how some spotting methods rise to the challenge of the case made for them. The common computational structure amongst spotting methods suggests an architecture for spotting computation.

96 citations


Journal ArticleDOI
TL;DR: The actions of Trojan horses and viruses in real computer systems are considered and a minimal framework for an adequate formal understanding of the phenomena is suggested.
Abstract: It is not possible to view a computer operating in the real world, including the possibility of Trojan horse programs and computer viruses, as simply a finite realisation of a Turing machine. We consider the actions of Trojan horses and viruses in real computer systems and suggest a minimal framework for an adequate formal understanding of the phenomena. Some conventional approaches, including biological metaphors, are shown to be inadequate; some suggestions are made towards constructing virally-resistant systems.

87 citations


Journal ArticleDOI
TL;DR: This work extends MML classification to domains where the ‘things’ have a known spatial arrangement and it may be expected that the classes of neighbouring things are correlated, and combines the Snob algorithm with a simple dynamic programming algorithm.
Abstract: Intrinsic classification, or unsupervised learning of a classification, was the earliest application of what is now termed minimum message length (MML) or minimum description length (MDL) inference. The MML algorithm ‘Snob’ and its relatives have been used successfully in many domains. These algorithms treat the ‘things’ to be classified as independent random selections from an unknown population whose class structure, if any, is to be estimated. This work extends MML classification to domains where the ‘things’ have a known spatial arrangement and it may be expected that the classes of neighbouring things are correlated. Two cases are considered. In the first, the things are arranged in a sequence and the correlation between the classes of successive things modelled by a first-order Markov process. An algorithm for this case is constructed by combining the Snob algorithm with a simple dynamic programming algorithm. The method has been applied to the classification of protein secondary structure. In the second case, the things are arranged on a two-dimensional (2D) square grid, like the pixels of an image. Correlation is modelled by a prior over patterns of class assignments whose log probability depends on the number of adjacent mismatched pixel pairs. The algorithm uses Gibbs sampling from the pattern posterior and a thermodynamic relation to calculate message length.

58 citations


Journal ArticleDOI
TL;DR: A new approach to database systems architecture is intended to take advantage of solid-state memory in combination with data compression to provide substantial performance improvements and is capable of greater cost/effectiveness than conventional approaches.
Abstract: Future database applications will require significant improvements in performance beyond the capabilities of conventional disk based systems. This paper describes a new approach to database systems architecture, which is intended to take advantage of solid-state memory in combination with data compression to provide substantial performance improvements. The compressed data representation is tailored to the data manipulation operations requirements. The architecture has been implemented and measurements of performance are compared to those obtained using other high-performance database systems. The results indicate from one to five orders of magnitude speed-up in retrieval, equivalent or slightly faster performance during insertion (and compression) of data, while achieving approximately one order of magnitude compression in data volume. The resultant architecture is thus capable of greater cost/effectiveness than conventional approaches.

46 citations


Journal ArticleDOI
TL;DR: This paper uses a Markov chain to describe the behavior of the mobile user and analyzes the best time when forwarding and resetting should be performed in order to optimize the service rate of the PCS network.
Abstract: This paper presents a methodology for evaluating the performance of forwarding strategies for location management in a personal communication services (PCS) mobile network. A forwarding strategy in the PCS network can be implemented by two mechanisms: a forwarding operation which follows a chain of databases to locate a mobile user and a resetting operation which updates the databases in the chain so that the current location of a mobile user can be known directly without having to follow a chain of databases. In this paper, we consider the PCS network as a server whose function is to provide services to the mobile user for ‘updating the location of the user as the user moves across a database boundary’ and ‘locating the mobile user’. We use a Markov chain to describe the behavior of the mobile user and analyze the best time when forwarding and resetting should be performed in order to optimize the service rate of the PCS network. We demonstrate the applicability of our approach with hexagonal and mesh coverage models for the PCS network and provide a physical interpretation of the result.

45 citations


Journal ArticleDOI
TL;DR: The ‘software crisis’ is discussed as a social and cultural phenomenon, arguing that it can be viewed as (one more) manifestation of postmodernism.
Abstract: We discuss the ‘software crisis’ as a social and cultural phenomenon, arguing that it can be viewed as (one more) manifestation of postmodernism. We illustrate our argument with a range of examples taken from software engineering, demonstrating software engineering’s roots in (and

44 citations


Journal ArticleDOI
TL;DR: The algorithmic design of a worldwide location service for distributed objects is described, based on a worldwide distributed search tree in which addresses are stored at different levels, depending on the migration pattern of the object.
Abstract: We describe the algorithmic design of a worldwide location service for distributed objects. A distributed object can reside at multiple locations at the same time, and offers a set of addresses to allow client processes to contact it. Objects may be highly mobile like, for example, software agents or Web applets. The proposed location service supports regular updates of an object's set of contact addresses, as well as efficient look-up operations. Our design is based on a worldwide distributed search tree in which addresses are stored at different levels, depending on the migration pattern of the object. By exploiting an object's relative stability with respect to a region, combined with the use of pointer caches, look-up operations can be made highly efficient.

Journal ArticleDOI
TL;DR: Peter Wegner’s definition of computability differs markedly from the classical term as established by Church, Kleene, Markov, Post, Turing, Turing et al, and it is shown that Church's thesis still holds.
Abstract: Peter Wegner’s definition of computability differs markedly from the classical term as established by Church, Kleene, Markov, Post, Turing et al. Wegner identifies interaction as the main feature of today’s systems which is lacking in the classical treatment of computability. We compare the different approaches and argue whether or not Wegner’s criticism is appropriate. Taking into account the major arguments from the literature, we show that Church’s thesis still holds.

Journal ArticleDOI
TL;DR: This work shows that methods used to derive a checking experiment from a nondeterministic finite state machine can be extended if it is known that the implementation is equivalent to some (unknown) deterministic infinite state machine.
Abstract: A number of authors have looked at the problem of deriving a checking experiment from a nondeterministic finite state machine that models the required behaviour of a system. We show that these methods can be extended if it is known that the implementation is equivalent to some (unknown) deterministic finite state machine. When testing a deterministic implementation, the test output provides information about the implementation under test and can thus guide future testing. The use of an adaptive test process is thus proposed.

Journal ArticleDOI
TL;DR: This paper reviews measures of similarity and dissimilarity between pairs of chemical molecules and the use of such measures for processing chemical databases, focusing upon measures that are based on fragment bit-string occurrence data.
Abstract: This paper reviews measures of similarity and dissimilarity between pairs of chemical molecules and the use of such measures for processing chemical databases. The applications discussed include similarity searching, database clustering and diversity analysis, focusing upon measures that are based on fragment bit-string occurrence data. The paper then discusses recent work on the calculation of similarity by aligning molecular fields and on the selection of structurally diverse subsets of chemical databases.

Journal ArticleDOI
TL;DR: It is shown that for suitably large n, there are suitable values of p such that for randomly chosen graphs G ∈?
Abstract: Several methods exist for routing messages in a network without using complete routing tables (compact routing). In k-interval routing schemes (k-IRS), links carry up to k intervals each. A message is routed over a certain link if its destination belongs to one of the intervals of the link. We present some results for the necessary value of k in order to achieve shortest-path routing. Even though low values of k suffice for very structured networks, we show that for 'general graphs' interval routing cannot significantly reduce the space requirements for shortest-path routing. In particular we show that for suitably large n, there are suitable values of p such that for randomly chosen graphs G ∈? n,P following holds, with high probability: if G admits an optimal k-IRS, then k = Ω(n 1 - 6/ln(np) - ln(np) / ln n ). The result is obtained by means of a novel matrix representation for the shortest paths in a network.

Journal ArticleDOI
TL;DR: This paper actually construct a weakest precondition semantics from a relational semantics proposed by the Z standards panel and additionally establishes an isomorphism between weakest precONDitions and relations.
Abstract: The lack of a method for developing programs from Z specifications is a widely recognized difficulty. In response to this problem, different approaches to the integration of Z with a refinement calculus have been proposed. These programming techniques are promising, but as far as we know, have not been formalized. Since they are based on refinement calculi formalized in terms of weakest preconditions, the definition of a weakest precondition semantics for Z is a significant contribution to the solution of this problem. In this paper, we actually construct a weakest precondition semantics from a relational semantics proposed by the Z standards panel. The construction provides reassurance as to the adequacy of the resulting semantics definition and additionally establishes an isomorphism between weakest preconditions and relations. Compositional formulations for the weakest precondition of some schema calculus expressions are provided.

Journal ArticleDOI
TL;DR: Implementing the DRM method within LSI++ not only provides downdating functionality, but is less time consuming than recomputing the SVD when removing a term, document or both.
Abstract: Due to the growth of large data collections, information retrieval or database searching is of vital importance. Lexical matching techniques may retrieve irrelevant or inaccurate results because of synonyms and polysemous words, so effective concept-based techniques are needed. One such technique is latent semantic indexing (LSI) which uses a vector-space approach by identifying documents whose content is related to the user's query in order of similarity. LSI uses the singular value decomposition (SVD) of term-by-document matrix to encode the terms and documents in a vector-space model. Existing methods for removing terms or documents from the term-document space are either time consuming or do not sufficiently change the term-document relationships. This paper presents a new method for downdating, downdating the reduced model (or DRM) method, and discusses its implementation into the LSI++ software environment. The DRM method can be used to assess the effect that a term or document has on the clustering of relevant information in a collection and for the incorporation of user feedback in the existing LSI model. Implementing the DRM method within LSI++ not only provides downdating functionality, but is less time consuming than recomputing the SVD when removing a term, document or both. The DRM method is a viable algorithm for dynamic information modeling and retrieval.

Journal ArticleDOI
TL;DR: The conclusion is that simple links, whether embbeded or separate, generic links, and some adaptive links all give hypertext systems the power of finite state automata.
Abstract: In this paper, we study how linking mechanisms contribute to the expressiveness of hypertext systems. For this purpose, we formalize hypertext systems as abstract machines. As the primary benefit of hypertext systems is to be able to read documents non-linearly, their expressiveness is defined in terms of the ability to follow links. Then, we classify hypertext systems according to the power of the underlying automaton. The model allows us to compare embedded versus separate links and simple versus generic links. Then, we investigate history mechanisms, adaptive hypertexts and functional links. Our conclusion is that simple links, whether embedded or separate, generic links and some adaptive links all give hypertext systems the power of finite state automata. The history mechanism confers to them the power of pushdown automata, whereas the general functional links give them Turing completeness.

Journal ArticleDOI
TL;DR: A definition of sequence similarity based on the shape of sequences is introduced to handle sequence matching with linear scaling in both amplitude and time dimensions and a fast sequence searching algorithm based on extendable hashing is proposed.
Abstract: In real life, data collected day by day often appear in sequences and this type of data is called sequence data. The technique of searching for similar patterns among sequence data is very important in many applications. We first point out that there are some deficiencies in the existing definitions of sequence similarity. We then introduce a definition of sequence similarity based on the shape of sequences. The definition is also extended to handle sequence matching with linear scaling in both amplitude and time dimensions. A fast sequence searching algorithm based on extendable hashing is also proposed. The algorithm can match linearly scaled sequences and guarantee that no qualified data subsequence is falsely rejected. Several experiments are performed on real data (stock price movement) and synthetic data to measure the performance of the algorithm in different aspects.

Journal ArticleDOI
TL;DR: The proximity relations inherent in triangulations of geometric data can be exploited in the implementation of nearest-neighbour search procedures, relevant to applications such as terrain analysis, cartography and robotics, in which triangulation may be used to model the spatial data.
Abstract: The proximity relations inherent in triangulations of geometric data can be exploited in the implementation of nearest-neighbour search procedures. This is relevant to applications such as terrain analysis, cartography and robotics, in which triangulations may be used to model the spatial data. Here we describe neighbourhood search procedures within constrained Delaunay triangulations of the vertices of linear objects, for the queries of nearest object to an object and the nearest object to an arbitrary point. The procedures search locally from object edges, or from a query point, to build triangulated regions that extend from the source edge or point by a distance at least equal to that to its nearest neighbouring feature. Several geographical datasets have been used to evaluate the procedures experimentally. Average numbers of edge‐edge distance calculations to find the nearest line feature edge disjoint to another line feature edge ranged between 15 and 39 for the different datasets examined, while the average numbers of point‐edge distance calculations to determine the nearest edge to an arbitrary point ranged between 7 and 35.

Journal ArticleDOI
TL;DR: The generalization of the test sequencing problem, originally defined for symmetrical tests, that also covers asymmetrical tests is presented, proving that the same heuristics that has been employed in the traditional solution of the problem can be employed also for the generalized case.
Abstract: In this paper we present the generalization of the test sequencing problem, originally defined for symmetrical tests, that also covers asymmetrical tests. We prove that the same heuristics that has been employed in the traditional solution of the problem (e.g., the AO * algorithm with heuristics based on Huffman's coding) can be employed also for the generalized case. Examples are given to illustrate the approach.

Journal ArticleDOI
TL;DR: In this article, the authors examine the nature and significance of various potential attacks, and survey the defence options available, concluding that IT owners need to think of the threat in more global terms, and to give a new focus and priority to their defence.
Abstract: Large-scale commercial, industrial and financial operations are becoming ever more interdependent, and ever more dependent on IT. At the same time, the rapidly growing interconnectivity of IT systems, and the convergence of their technology towards industry-standard hardware and software components and sub-systems, renders these IT systems increasingly vulnerable to malicious attack. This paper is aimed particularly at readers concerned with major systems employed in medium to large commercial or industrial enterprises. It examines the nature and significance of the various potential attacks, and surveys the defence options available. It concludes that IT owners need to think of the threat in more global terms, and to give a new focus and priority to their defence. Prompt action can ensure a major improvement in IT resilience at a modest marginal cost, both in terms of finance and in terms of normal IT operation.

Journal ArticleDOI
TL;DR: In this paper an algorithmic transformation from a trace-based specification of a concurrent system to a Petri Net model is described and Causal dependencies between behaviours of the system components are introduced in the net model through the definition of external assumptions.
Abstract: CSP and Petri Nets are powerful formalisms for the specification and the analysis of concurrent systems. We present an approach to their integration to take advantage of both formalisms. In particular the GSPN class is used to address dependability and real-time aspects. In this paper an algorithmic transformation from a trace-based specification of a concurrent system to a Petri Net model is described. Causal dependencies between behaviours of the system components are introduced in the net model through the definition of external assumptions. The steps of the integration are illustrated by applying them to an unmanned transportation problem.

Journal ArticleDOI
Boris Mirkin1
TL;DR: Approximation structuring clustering appears to be not only a mathematical device to support, specify and extend many clustering techniques, but also a framework for mathematical analysis of interrelations among the techniques and their relations to other concepts and problems in data analysis, statistics, machirre learning, data compression and decompression and the design and use of multiresolution hierarchies.
Abstract: Approximation structuring clustering is an extension of what is usually called 'square-error clustering' onto various cluster structures and data formats. It appears to be not only a mathematical device to support, specify and extend many clustering techniques, but also a framework for mathematical analysis of interrelations among the techniques and their relations to other concepts and problems in data analysis, statistics, machirre learning, data compression and decompression and the design and use of multiresolution hierarchies. Based on the results found, a number of methods for solving data processing problems are described.

Journal ArticleDOI
TL;DR: In this article, the authors study decentralized probabilistic job dispatching and load balancing strategies which optimize the performance of heterogeneous M/G/1 computer systems, and derive closed form solutions for optimal dispatching probabilities which minimize the average job response time when all nodes have an identical coefficient of variation for job execution times.
Abstract: In this paper, we study decentralized probabilistic job dispatching and load balancing strategies which optimize the performance of heterogeneous multiple computer systems. We present a model to study a heterogeneous multiple computer system with a decentralized stochastic job dispatching mechanism, where nodes are treated as M/G/1 servers. We discuss a way to implement a virtual centralized job dispatcher using a distributed control mechanism. We derive closed form solutions for optimal job dispatching probabilities which minimize the average job response time, when all nodes have an identical coefficient of variation for job execution times. We also generalize the results to the case where nodes have different coefficients of variation for job execution times.

Journal ArticleDOI
TL;DR: The counterproof in this paper is given and the corrected bound on the longest routing path that was derived is not, which shows that the interval routing algorithm cannot be optimal in networks with arbitrary topology.
Abstract: Interval routing is a space-efficient routing method for computer networks. The method is said to be optimal if it can generate optimal routing paths for any source-destination node pair. A path is optimal if it is a shortest path between the two nodes involved. A seminal result in the area, however, has pointed out that 'the interval routing algorithm cannot be optimal in networks with arbitrary topology'. The statement is correct but the lower bound on the longest routing path that was derived is not. We give the counterproof in this paper and the corrected bound.

Journal ArticleDOI
TL;DR: A review is given for the data analysis task of representing a symmetric proximity matrix by a sum of matrices each having the restrictive anti-Robinson (AR) form, with an emphasis on the inclusion of an optimal monotonic transformation of the given proximity matrix.
Abstract: A review is given for the data analysis task of representing a symmetric proximity matrix, defined for some object set, by a sum of matrices each having the restrictive anti-Robinson (AR) form. An emphasis is placed on the inclusion of an optimal monotonic transformation of the given proximity matrix and what each AR component of an additive decomposition might be depicting by imposing further restrictions to obtain approximating matrices that are strongly AR, or that provide unidimensional scales or ultrametrics. Three published data sets are used to illustrate the process of constructing the initial decomposition and then giving a substantive interpretation subsequently for each of the terms in the fitted sum. An extension to circular anti-Robinson (CAR) matrices is also discussed briefly and illustrated, along with further restrictions to circular unidimensional scales and circular strongly AR forms.

Journal ArticleDOI
TL;DR: The development of a distributed asynchronous atomic action scheme for Ada 95 makes use of many unique Ada 95 features including protected objects, asynchronous transfer of control and the distributed systems annex.
Abstract: This paper discusses the development of a distributed asynchronous atomic action scheme for Ada 95 The scheme makes use of many unique Ada 95 features including protected objects, asynchronous transfer of control and the distributed systems annex We present the packages which implement the local and global action support and illustrate their use in a (partial) implementation of the FZI production cell problem We also discuss a number of variations of the model and how these might be included Finally, we discuss how the distribution model used in Ada 95 has influenced our design

Journal ArticleDOI
TL;DR: In this paper, codes that detect a single substitution error or a single transposition error are studied and it is shown that such codes of length n over an alphabet of q characters have at most q n-1 codewords if q > 3 and at most [2 n /3] codeword if q = 2.
Abstract: Substitution errors, where individual characters are altered, and transposition errors, where two consecutive characters are interchanged, are commonly caused by human operators. In this paper, codes that detect a single substitution error or a single transposition error are studied. In particular, it is shown that such codes of length n over an alphabet of q characters have at most q n-1 codewords if q > 3 and at most [2 n /3] codewords if q = 2. Codes which have that many codewords are called optimal codes. We present optimal codes for all values of n and q. Simple encoding techniques for these codes are also described.

Journal ArticleDOI
TL;DR: Combining the fault-tolerant procedure and the optimal broadcasting algorithm, a fault-Tolerant broadcasting is achieved on the arrangement graph.
Abstract: This paper proposes a distributed fault-tolerant algorithm for one-to-all broadcasting in the one-port communication model on the arrangement graph. Exploiting the hierarchical properties of the arrangement graph to constitute different-sized broadcasting trees for different-sized subgraphs, we propose a distributed algorithm with optimal time complexity and without message redundancy for one-to-all broadcasting in the one-port communication model for the fault-free arrangement graph. According to the property that there is a family of k(n - k) node-disjoint paths between any two nodes, we develop a fast fault-tolerant procedure capable of sending a message from a node to its adjacent nodes on the (n, k)-arrangement graph with less than k(n - k) faulty edges. Combining the fault-tolerant procedure and the optimal broadcasting algorithm, a fault-tolerant broadcasting is achieved on the arrangement graph. It is shown that a message can be broadcast to all the other (n!/(n - k)!)- 1 processors in O(k lg n) steps if no faults exist on the (n, k)-arrangement graph, and in O(k 2 lg n + klg 2 n)) steps if the number of faulty edges is less than k(n - k).