Author
Chew Lim Tan
Other affiliations: Information Technology University, Institute for Infocomm Research Singapore, Beijing University of Technology
Bio: Chew Lim Tan is an academic researcher from National University of Singapore. The author has contributed to research in topics: Image segmentation & Optical character recognition. The author has an hindex of 60, co-authored 399 publications receiving 13795 citations. Previous affiliations of Chew Lim Tan include Information Technology University & Institute for Infocomm Research Singapore.
Papers published on a yearly basis
Papers
More filters
TL;DR: This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy.
Abstract: Discrete values have important roles in data mining and knowledge discovery They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy Furthermore, many induction algorithms found in the literature require discrete features All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task There are numerous discretization methods available in the literature It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances We also identify some issues yet to solve and future research for discretization
981 citations
TL;DR: This study investigates several widely-used unsupervised and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms and proposes a new simple supervisedterm weighting method, tf.rf, to improve the terms' discriminating power for text categorization task.
Abstract: In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
551 citations
TL;DR: A robust system based on the concepts of Mutual Direction Symmetry (MDS), Mutual Magnitude Symmetric (MMS) and Gradient Vector Symmeter (GVS) properties to identify text pixel candidates regardless of any orientations including curves from natural scene images is presented.
Abstract: Text detection in the real world images captured in unconstrained environment is an important yet challenging computer vision problem due to a great variety of appearances, cluttered background, and character orientations. In this paper, we present a robust system based on the concepts of Mutual Direction Symmetry (MDS), Mutual Magnitude Symmetry (MMS) and Gradient Vector Symmetry (GVS) properties to identify text pixel candidates regardless of any orientations including curves (e.g. circles, arc shaped) from natural scene images. The method works based on the fact that the text patterns in both Sobel and Canny edge maps of the input images exhibit a similar behavior. For each text pixel candidate, the method proposes to explore SIFT features to refine the text pixel candidates, which results in text representatives. Next an ellipse growing process is introduced based on a nearest neighbor criterion to extract the text components. The text is verified and restored based on text direction and spatial study of pixel distribution of components to filter out non-text components. The proposed method is evaluated on three benchmark datasets, namely, ICDAR2005 and ICDAR2011 for horizontal text evaluation, MSRA-TD500 for non-horizontal straight text evaluation and on our own dataset (CUTE80) that consists of 80 images for curved text evaluation to show its effectiveness and superiority over existing methods.
413 citations
TL;DR: Empirical evidence that a neural network model is applicable to the prediction of foreign exchange rates and several issues on the frequency of sampling, choice of network architecture, forecasting periods, and measures for evaluating the model’s predictive power are reported.
Abstract: This paper reports empirical evidence that a neural network model is applicable to the prediction of foreign exchange rates. Time series data and technical indicators, such as moving average, are fed to neural networks to capture the underlying ‘rulesa of the movement in currency exchange rates. The exchange rates between American Dollar and "ve other major currencies, Japanese Yen, Deutsch Mark, British Pound, Swiss Franc and Australian Dollar are forecast by the trained neural networks. The traditional rescaled range analysis is used to test the ‘e$ciencya of each market before using historical data to train the neural networks. The results presented here show that without the use of extensive market data or knowledge, useful prediction can be made and signi"cant paper pro"ts can be achieved for out-of-sample data with simple technical indicators. A further research on exchange rates between Swiss Franc and American Dollar is also conducted. However, the experiments show that with e$cient market it is not easy to make pro"ts using technical indicators or time series input neural networks. This article also discusses several issues on the frequency of sampling, choice of network architecture, forecasting periods, and measures for evaluating the model’s predictive power. After presenting the experimental results, a discussion on future research concludes the paper. ( 2000 Elsevier Science B.V. All rights reserved.
392 citations
01 Dec 2013
TL;DR: This paper introduces a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints and significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-key points approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
378 citations
Cited by
More filters
[...]
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.
Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …
33,785 citations
Journal Article•
28,685 citations
[...]
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
13,246 citations
Posted Content•
TL;DR: In this paper, the authors provide a unified and comprehensive theory of structural time series models, including a detailed treatment of the Kalman filter for modeling economic and social time series, and address the special problems which the treatment of such series poses.
Abstract: In this book, Andrew Harvey sets out to provide a unified and comprehensive theory of structural time series models. Unlike the traditional ARIMA models, structural time series models consist explicitly of unobserved components, such as trends and seasonals, which have a direct interpretation. As a result the model selection methodology associated with structural models is much closer to econometric methodology. The link with econometrics is made even closer by the natural way in which the models can be extended to include explanatory variables and to cope with multivariate time series. From the technical point of view, state space models and the Kalman filter play a key role in the statistical treatment of structural time series models. The book includes a detailed treatment of the Kalman filter. This technique was originally developed in control engineering, but is becoming increasingly important in fields such as economics and operations research. This book is concerned primarily with modelling economic and social time series, and with addressing the special problems which the treatment of such series poses. The properties of the models and the methodological techniques used to select them are illustrated with various applications. These range from the modellling of trends and cycles in US macroeconomic time series to to an evaluation of the effects of seat belt legislation in the UK.
4,252 citations