scispace - formally typeset
Search or ask a question
Author

Sang-Jo Lee

Bio: Sang-Jo Lee is an academic researcher from Kyungpook National University. The author has contributed to research in topics: Ontology (information science) & Ontology-based data integration. The author has an hindex of 11, co-authored 57 publications receiving 470 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The advantage and effectiveness of the proposed criteria will be shown through the comparison of maximum delay bounds with some results obtained by recently published papers via four numerical examples.

173 citations

Proceedings ArticleDOI
15 Dec 2005
TL;DR: Web pages are classified in real time not with experimental data or a learning process, but by similar calculations between the terminology information extracted from Web pages and ontology categories, which results in a more accurate document classification.
Abstract: The use of ontology in order to provide a mechanism to enable machine reasoning has continuously increased during the last few years. This paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontology-based document classification involves determining document features that represent the Web documents most accurately, and classifying them into the most appropriate categories after analyzing their contents by using at least two predefined categories per given document features. In this paper, Web pages are classified in real time not with experimental data or a learning process, but by similar calculations between the terminology information extracted from Web pages and ontology categories. This results in a more accurate document classification since the meanings and relationships unique to each document are determined.

40 citations

Journal ArticleDOI
TL;DR: A new indexing formalism is developed that considers not only the terms in a document, but also the concepts, and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document.
Abstract: Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.

40 citations

Journal ArticleDOI
TL;DR: Through experiments on the TREC-2 collection of Wall Street Journal documents, it is shown that the proposed indexing formalism outperforms an indexing method based on term frequency (TF), especially in regard to the highest-ranked documents.

37 citations

Proceedings ArticleDOI
19 Jul 2009
TL;DR: A novel method to translate tags attached to multimedia contents for cross-language retrieval by selecting the optimal one from possible candidates based on a network similarity even when neither the textual contexts nor sophisticated language resources are available.
Abstract: This paper proposes a novel method to translate tags attached to multimedia contents for cross-language retrieval. The main issue in this problem is the sense disambiguation of tags given with few textual contexts. In order to solve this problem, the proposed method represents both tags and its translation candidates as networks of co-occurring tags since a network allows richer expression of contexts than other expressions such as co-occurrence vectors. The method translates a tag by selecting the optimal one from possible candidates based on a network similarity even when neither the textual contexts nor sophisticated language resources are available. The experiments on the MIR Flickr-2008 test set show that the proposed method achieves 90.44% accuracy in translating tags from English into German, which is significantly higher than the baseline methods of a frequency based translation and a co-occurrence-based translation.

24 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing techniques and methodologies, focused mainly on text representation and machine learning techniques.
Abstract: With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and know- ledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing tech- niques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing litera- ture.

546 citations

Proceedings ArticleDOI
29 Mar 2010
TL;DR: This paper provides an overview of the various strategies that were devised for automatic visual concept detection using the MIR Flickr collection, and discusses results from various experiments in combining social data and low-level content-based descriptors to improve the accuracy of visual concept classifiers.
Abstract: The MIR Flickr collection consists of 25000 high-quality photographic images of thousands of Flickr users, made available under the Creative Commons license. The database includes all the original user tags and EXIF metadata. Additionally, detailed and accurate annotations are provided for topics corresponding to the most prominent visual concepts in the user tag data. The rich metadata allow for a wide variety of image retrieval benchmarking scenarios.In this paper, we provide an overview of the various strategies that were devised for automatic visual concept detection using the MIR Flickr collection. In particular we discuss results from various experiments in combining social data and low-level content-based descriptors to improve the accuracy of visual concept classifiers. Additionally, we present retrieval results obtained by relevance feedback methods, demonstrating (i) how their performance can be enhanced using features based on visual concept classifiers, and (ii) how their performance, based on small samples, can be measured relative to their large sample classifier counterparts.Additionally, we identify a number of promising trends and ideas in visual concept detection. To keep the MIR Flickr collection up-to-date on these developments, we have formulated two new initiatives to extend the original image collection. First, the collection will be extended to one million Creative Commons Flickr images. Second, a number of state-of-the-art content-based descriptors will be made available for the entire collection.

374 citations

Proceedings ArticleDOI
Susan Brewer1
01 Sep 1959
TL;DR: The letter and/or sound combinations that make up a human language are limited by the human's ability to pronounce tnese sounds° Therefore, the standard library search, which as a rule looks for all possible combinations of letters to find a word, is wasteful.
Abstract: The letter and/or sound combinations that make up a human language are limited by the human's ability to pronounce tnese sounds° Therefore, the standard library search, which as a rule looks for all possible combinations of letters to find a word, is wasteful. Certain letters simply cannot be followed by certain other letters and a search for them is senseless. Following this same line of reasoning, letters very frequently occur in the combinations that are germane to the particular language. The growing amount of alphanumeric information presently being stored on magnetic tape presents increasingly difficult problems in both the number of tape reels used and the time necessary to search this mass of information in order to extract pertinent literature. At the present time most of this literature on tape utilizes the standard IBM 6-bit code to express alphanumeric symbols. ~t is entirely feasible to record standard English literature on tape -be it professional abstracts or novels -using only approximately two-thirds of the binary bits utilized to represent the same piece of written material in the conventional code. This can be accomplished by setting up, in a 9-bit code, the 400-odd letter combinations occurring most frequently. A 9-bit representation allows the programmer to set up as many as 512 symbols, thus leaving sufficient leeway to assign symbols to the most frequentlyused words, mathematical symbols, professional expressions, that are expected to be encountered in the literature to be recorded. In addition, these relatively short 9-bit symbols can be assigned to all key words that it may be necessary to look for later, thereby accelerating any future library search.

298 citations

Journal ArticleDOI
TL;DR: A CF-based recommendation methodology based on both implicit ratings and less ambitious ordinal scales is proposed, and a specific consensus model typically used in multi-criteria decision-making (MCDM) is employed to generate an ordinal scale-based customer profile.

222 citations