scispace - formally typeset
Search or ask a question
Author

Tae-Kyun Kim

Other affiliations: KAIST, University of Cambridge, NEC  ...read more
Bio: Tae-Kyun Kim is an academic researcher from Imperial College London. The author has contributed to research in topics: Pose & Facial recognition system. The author has an hindex of 51, co-authored 295 publications receiving 9522 citations. Previous affiliations of Tae-Kyun Kim include KAIST & University of Cambridge.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel discriminative learning method over sets is proposed for set classification that maximizes the canonical correlations of within-class sets and minimizes thecanon correlations of between- class sets.
Abstract: We address the problem of comparing sets of images for object recognition, where the sets may represent variations in an object's appearance due to changing camera pose and lighting conditions. canonical correlations (also known as principal or canonical angles), which can be thought of as the angles between two d-dimensional subspaces, have recently attracted attention for image set matching. Canonical correlations offer many benefits in accuracy, efficiency, and robustness compared to the two main classical methods: parametric distribution-based and nonparametric sample-based matching of sets. Here, this is first demonstrated experimentally for reasonably sized data sets using existing methods exploiting canonical correlations. Motivated by their proven effectiveness, a novel discriminative learning method over sets is proposed for set classification. Specifically, inspired by classical linear discriminant analysis (LDA), we develop a linear discriminant function that maximizes the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets. Image sets transformed by the discriminant function are then compared by the canonical correlations. Classical orthogonal subspace method (OSM) is also investigated for the similar purpose and compared with the proposed method. The proposed method is evaluated on various object recognition problems using face image sets with arbitrary motion captured under different illuminations and image sets of 500 general objects taken at different views. The method is also applied to object category recognition using ETH-80 database. The proposed method is shown to outperform the state-of-the-art methods in terms of accuracy and efficiency

626 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The Latent Regression Forest is presented, a novel framework for real-time, 3D hand pose estimation from a single depth image and shows that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.
Abstract: In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.

424 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: This work collects RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations, and sees clear benefits of using hand pose as a cue for action recognition compared to other data modalities.
Abstract: In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Towards this goal, we collected RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations. To obtain hand pose annotations, we used our own mo-cap system that automatically infers the 3D location of each of the 21 joints of a hand model via 6 magnetic sensors and inverse kinematics. Additionally, we recorded the 6D object poses and provide 3D object models for a subset of hand-object interaction sequences. To the best of our knowledge, this is the first benchmark that enables the study of first-person hand actions with the use of 3D hand poses. We present an extensive experimental evaluation of RGB-D and pose-based action recognition by 18 baselines/state-of-the-art approaches. The impact of using appearance features, poses, and their combinations are measured, and the different training/testing protocols are evaluated. Finally, we assess how ready the 3D hand pose estimation field is when hands are severely occluded by objects in egocentric views and its influence on action recognition. From the results, we see clear benefits of using hand pose as a cue for action recognition compared to other data modalities. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

391 citations

Proceedings ArticleDOI
17 Jun 2007
TL;DR: A time-efficient action detection method based on dynamic learning of subspaces for tensor CCA for the case that actions are not aligned in the space-time domain is proposed.
Abstract: We introduce a new framework, namely tensor canonical correlation analysis (TCCA) which is an extension of classical canonical correlation analysis (CCA) to multidimensional data arrays (or tensors) and apply this for action/gesture classification in videos. By tensor CCA, joint space-time linear relationships of two video volumes are inspected to yield flexible and descriptive similarity features of the two videos. The TCCA features are combined with a discriminative feature selection scheme and a nearest neighbor classifier for action classification. In addition, we propose a time-efficient action detection method based on dynamic learning of subspaces for tensor CCA for the case that actions are not aligned in the space-time domain. The proposed method delivered significantly better accuracy and comparable detection speed over state-of-the-art methods on the KTH action data set as well as self-recorded hand gesture data sets.

343 citations

Journal ArticleDOI
TL;DR: This work provides a thorough review on the development of this problem in recent decades and inspects the recent advances in various aspects and proposes some interesting directions for future research.

340 citations


Cited by
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Abstract: The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi-channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results.

3,833 citations