scispace - formally typeset
Search or ask a question

Showing papers by "Zhi-Hua Zhou published in 2020"


Proceedings Article
Lan-Zhe Guo1, Zhen-Yu Zhang1, Yuan Jiang1, Yu-Feng Li1, Zhi-Hua Zhou1 
12 Jul 2020
TL;DR: A simple and effective safe deep SSL method to alleviate the harm caused by class distribution mismatch and it is theoretically guaranteed that its generalization approaches the optimal in the order O( √ d ln(n)/n), even faster than the convergence rate in supervised learning associated with massive parameters.
Abstract: Deep semi-supervised learning (SSL) has been recently shown very effectively. However, its performance is seriously decreased when the class distribution is mismatched, among which a common situation is that unlabeled data contains some classes not seen in the labeled data. Efforts on this issue remain to be limited. This paper proposes a simple and effective safe deep SSL method to alleviate the harm caused by it. In theory, the result learned from the new method is never worse than learning from merely labeled data, and it is theoretically guaranteed that its generalization approaches the optimal in the order O( √ d ln(n)/n), even faster than the convergence rate in supervised learning associated with massive parameters. In the experiment of benchmark data, unlike the existing deep SSL methods which are no longer as good as supervised learning in 40% of unseen-class unlabeled data, the new method can still achieve performance gain in more than 60% of unseen-class unlabeled data. Moreover, the proposal is suitable for many deep SSL algorithms and can be easily extended to handle other cases of class distribution mismatch.

96 citations


Proceedings Article
03 Jun 2020
TL;DR: It is demonstrated that a simple restarted strategy is sufficient to attain the same regret guarantee, and an UCB-type algorithm is designed to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters.
Abstract: This paper investigates the problem of nonstationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let T be the time horizon, d be the dimension, and PT be the pathlength that measures the fluctuation of the evolving unknown parameter, our approach enjoys an Õ(d(1 + PT ) T ) dynamic regret, which is nearly optimal, matching the Ω(d(1+PT ) T ) minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.

51 citations


Journal ArticleDOI
TL;DR: In this paper, the authors focus on recurrent neural networks (RNNs), especially gated RNNs whose inner mechanism is still not clearly understood and find that finite-state automata (FSA) have a more interpretable inner mechanism according to the definition of interpretability and can be learned from RNN as the interpretable structure.
Abstract: The interpretability of deep learning models has raised extended attention these years. It will be beneficial if we can learn an interpretable structure from deep learning models. In this article, we focus on recurrent neural networks (RNNs), especially gated RNNs whose inner mechanism is still not clearly understood. We find that finite-state automaton (FSA) that processes sequential data have a more interpretable inner mechanism according to the definition of interpretability and can be learned from RNNs as the interpretable structure. We propose two methods to learn FSA from RNN based on two different clustering methods. With the learned FSA and via experiments on artificial and real data sets, we find that FSA is more trustable than the RNN from which it learned, which gives FSA a chance to substitute RNNs in applications involving humans’ lives or dangerous facilities. Besides, we analyze how the number of gates affects the performance of RNN. Our result suggests that gate in RNN is important but the less the better, which could be a guidance to design other RNNs. Finally, we observe that the FSA learned from RNN gives semantic aggregated states, and its transition graph shows us a very interesting vision of how RNNs intrinsically handle text classification tasks.

50 citations


Journal ArticleDOI
TL;DR: The Optimal margin Distribution Machine (ODM) is proposed, which can achieve a better generalization performance by optimizing the margin distribution explicitly and its superiority is verified both theoretically and empirically in this paper.
Abstract: Support Vector Machine (SVM) has always been one of the most successful learning algorithms, with the central idea of maximizing the minimum margin , i.e., the smallest distance from the instances to the classification boundary. However, recent theoretical results disclosed that maximizing the minimum margin does not necessarily lead to better generalization performance, and instead, the margin distribution has been proven to be more crucial. Based on this idea, we propose the Optimal margin Distribution Machine (ODM), which can achieve a better generalization performance by optimizing the margin distribution explicitly. We characterize the margin distribution by the first- and second-order statistics, i.e., the margin mean and variance. The proposed method is a general learning approach which can be applied in any place where SVMs are used, and its superiority is verified both theoretically and empirically in this paper.

46 citations


Posted Content
TL;DR: Novel online algorithms are proposed that are capable of leveraging smoothness and replace the dependence on $T$ in the dynamic regret by problem-dependent quantities: the variation in gradients of loss functions, and the cumulative loss of the comparator sequence.
Abstract: We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is $\mathcal{O}(\sqrt{T(1+P_T)})$. Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the dynamic regret by exploiting the smoothness condition. Specifically, we propose novel online algorithms that are capable of leveraging smoothness and replace the dependence on $T$ in the dynamic regret by problem-dependent quantities: the variation in gradients of loss functions, and the cumulative loss of the comparator sequence. These quantities are at most $\mathcal{O}(T)$ while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile guarantee the same rate in the worst case.

42 citations


Journal ArticleDOI
TL;DR: This work proposes a novel and effective approach to handle concept drift via model reuse, that is, reusing models trained on previous data to tackle the changes in nature.
Abstract: In many real-world applications, data are often collected in the form of a stream, and thus the distribution usually changes in nature, which is referred to as concept drift in the literature. We propose a novel and effective approach to handle concept drift via model reuse, that is, reusing models trained on previous data to tackle the changes. Each model is associated with a weight representing its reusability towards current data, and the weight is adaptively adjusted according to the performance of the model. We provide both generalization and regret analysis to justify the superiority of our approach. Experimental results also validate its efficacy on both synthetic and real-world datasets.

31 citations


Proceedings ArticleDOI
23 Aug 2020
TL;DR: It is revealed for the first time that an effective kernel based anomaly detector based on kernel mean embedding must employ a characteristic kernel which is data dependent, and IDK, which is based on a data dependent point kernel is demonstrated.
Abstract: We introduce Isolation Distributional Kernel as a new way to measure the similarity between two distributions. Existing approaches based on kernel mean embedding, which converts a point kernel to a distributional kernel, have two key issues: the point kernel employed has a feature map with intractable dimensionality; and it is data independent. This paper shows that Isolation Distributional Kernel (IDK), which is based on a data dependent point kernel, addresses both key issues. We demonstrate IDK's efficacy and efficiency as a new tool for kernel based anomaly detection. Without explicit learning, using IDK alone outperforms existing kernel based anomaly detector OCSVM and other kernel mean embedding methods that rely on Gaussian kernel. We reveal for the first time that an effective kernel based anomaly detector based on kernel mean embedding must employ a characteristic kernel which is data dependent.

26 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: A novel model named CG-CNN is proposed, which is a multi-instance learning framework that enhances the unified features for bug localization by exploiting structural and sequential nature from the control flow graph.
Abstract: During software maintenance, bug report is an effective way to identify potential bugs hidden in a software system. It is a great challenge to automatically locate the potential buggy source code according to a bug report. Traditional approaches usually represent bug reports and source code from a lexical perspective to measure their similarities. Recently, some deep learning models are proposed to learn the unified features by exploiting the local and sequential nature, which overcomes the difficulty in modeling the difference between natural and programming languages. However, only considering local and sequential information from one dimension is not enough to represent the semantics, some multi-dimension information such as structural and functional nature that carries additional semantics has not been well-captured. Such information beyond the lexical and structural terms is extremely vital in modeling program functionalities and behaviors, leading to a better representation for identifying buggy source code. In this paper, we propose a novel model named CG-CNN, which is a multi-instance learning framework that enhances the unified features for bug localization by exploiting structural and sequential nature from the control flow graph. Experimental results on widely-used software projects demonstrate the effectiveness of our proposed CG-CNN model.

21 citations


Journal ArticleDOI
TL;DR: This paper presents a new criterion, PRO Loss, concerning the prediction of all labels as well as the ranking of only relevant labels, and proposes ProSVM which optimizes PRO Loss efficiently using alternating direction method of multipliers.
Abstract: Multi-label learning methods assign multiple labels to one object. In practice, in addition to differentiating relevant labels from irrelevant ones, it is often desired to rank relevant labels for an object, whereas the ranking of irrelevant labels is not important. Thus, we require an algorithm to do classification and ranking of relevant labels simultaneously. Such a requirement, however, cannot be met because most existing methods were designed to optimize existing criteria, yet there is no criterion which encodes the aforementioned requirement. In this paper, we present a new criterion, PRO Loss , concerning the prediction of all labels as well as the ranking of only relevant labels. We then propose ProSVM which optimizes PRO Loss efficiently using alternating direction method of multipliers. We further improve its efficiency with an upper approximation that reduces the number of constraints from $O(T^2)$ O ( T 2 ) to $O(T)$ O ( T ) , where $T$ T is the number of labels. We then notice that in real applications, it is difficult to get full supervised information for multi-label data. To make the proposed algorithm more robust to supervised information, we adapt ProSVM to deal with the multi-label learning with partial labels problem. Experiments show that our proposal is not only superior on PRO Loss , but also highly competitive on existing evaluation criteria.

21 citations


Proceedings Article
01 Jan 2020
TL;DR: This work presents the first finite-sample rate O(n−1/(8d+2)) on the convergence of pure random forests for classification, which can be improved to be of O( n−1/3.87d-2) by considering the midpoint splitting mechanism.
Abstract: Random forests have been one of the successful ensemble algorithms in machine learning. The basic idea is to construct a large number of random trees individually and make prediction based on an average of their predictions. The great successes have attracted much attention on the consistency of random forests, mostly focusing on regression. This work takes one step towards convergence rates of random forests for classification. We present the first finite-sample rate O(n−1/(8d+2)) on the convergence of pure random forests for classification, which can be improved to be of O(n−1/(3.87d+2)) by considering the midpoint splitting mechanism. We introduce another variant of random forests, which follow Breiman’s original random forests but with different mechanisms on splitting dimensions and positions. We get a convergence rate O(n−1/(d+2)(lnn)1/(d+2)) for the variant of random forests, which reaches the minimax rate, except for a factor (lnn), of the optimal plug-in classifier under the L-Lipschitz assumption. We achieve tighter convergence rate O( √ lnn/n) under proper assumptions over structural data.

20 citations


Journal ArticleDOI
TL;DR: This paper proposes the multi-label Optimal margin Distribution Machine (mlODM), which optimizes the margin mean and variance of all label pairs efficiently and outperforms SVM-style multi- label methods.
Abstract: Multi-label support vector machine (Rank-SVM) is a classic and effective algorithm for multi-label classification. The pivotal idea is to maximize the minimum margin of label pairs, which is extended from SVM. However, recent studies disclosed that maximizing the minimum margin does not necessarily lead to better generalization performance, and instead, it is more crucial to optimize the margin distribution. Inspired by this idea, in this paper, we first introduce margin distribution to multi-label learning and propose multi-label Optimal margin Distribution Machine (mlODM), which optimizes the margin mean and variance of all label pairs efficiently. Extensive experiments in multiple multi-label evaluation metrics illustrate that mlODM outperforms SVM-style multi-label methods. Moreover, empirical study presents the best margin distribution and verifies the fast convergence of our method.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: In this article, an attempt-Semi-Supervised ABductive Learning (SS-ABL) framework was proposed to combine symbolic logical representation and numerical model optimization effectively.
Abstract: In many practical tasks, there are usually two kinds of common information: cheap unlabeled data and domain knowledge in the form of symbols. There are some attempts using one single information source, such as semi-supervised learning and abductive learning. However, there is little work to use these two kinds of information sources at the same time, because it is very difficult to combine symbolic logical representation and numerical model optimization effectively. The learning becomes even more challenging when the domain knowledge is insufficient. In this paper, we present an attempt-Semi-Supervised ABductive Learning (SS-ABL) framework. In this framework, semi-supervised learning is trained via pseudo labels of unlabeled data generated by abductive learning, and the background knowledge is refined via the label distribution predicted by semi-supervised learning. The above framework can be optimized iteratively and can be naturally interpretable. The effectiveness of our framework has been fully verified in the theft judicial sentencing of real legal documents. In the case of missing sentencing elements and mixed legal rules, our framework is apparently superior to many existing baseline practices, and provides explanatory assistance to judicial sentencing.

Proceedings Article
03 Jun 2020
TL;DR: A novel algorithm is proposed that achieves dynamic regret and optimal results for BCO in non-stationary environments and does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown.
Abstract: Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values In this paper, we investigate BCO in non-stationary environments and choose the \emph{dynamic regret} as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence Let $T$ be the time horizon and $P_T$ be the path-length of the comparator sequence that reflects the non-stationarity of environments We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the one-point and two-point feedback models The latter result is optimal, matching the $\Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown

Proceedings Article
12 Jul 2020
TL;DR: A novel discrepancy measure for data with evolving feature space and data distribution, named the evolving discrepancy is proposed, and the theory motivates the design of a learning algorithm which is further implemented by deep neural networks.
Abstract: In many real-world applications, data are collected in the form of a stream, whose feature space can evolve over time. For instance, in the environmental monitoring task, features can be dynamically vanished or augmented due to the existence of expired old sensors and deployed new sensors. Furthermore, besides the evolvable feature space, the data distribution is usually changing in the streaming scenario. When both feature space and data distribution are evolvable, it is quite challenging to design algorithms with guarantees, particularly theoretical understandings of generalization ability. To address this difficulty, we propose a novel discrepancy measure for data with evolving feature space and data distribution, named the evolving discrepancy. Based on that, we present the generalization error analysis, and the theory motivates the design of a learning algorithm which is further implemented by deep neural networks. Empirical studies on synthetic data verify the rationale of our proposed discrepancy measure, and extensive experiments on real-world tasks validate the effectiveness of our algorithm.

Proceedings Article
27 Aug 2020
TL;DR: This paper designs a simple algorithm based on the online ensemble, which provably enjoys the same (even slightly stronger) guarantee as the state-of-the-art rate, yet is much more efficient because the algorithm does not involve any nonconvex problem solving.
Abstract: Online learning in dynamic environments has recently drawn considerable attention, where dynamic regret is usually employed to compare decisions of online algorithms to dynamic comparators. In previous works, dynamic regret bounds are typically established in terms of regularity of comparators CT or that of online functions VT . Recently, Jadbabaie et al. [2015] propose an algorithm that can take advantage of both regularities and enjoy an Õ( √ 1 +DT + min{ √ (1 +DT )CT , (1+DT ) V 1/3 T T }) dynamic regret, where DT is an additional quantity to measure the niceness of environments. The regret bound adapts to the smaller regularity of problem environments and is tighter than all existing dynamic regret guarantees. Nevertheless, their algorithm involves non-convex programming at each iteration, and thus requires burdensome computations. In this paper, we design a simple algorithm based on the online ensemble, which provably enjoys the same (even slightly stronger) guarantee as the state-of-the-art rate, yet is much more efficient because our algorithm does not involve any nonconvex problem solving. Empirical studies also verify the efficacy and efficiency.

Posted Content
TL;DR: The exploratory machine learning is proposed, which examines and investigates the training dataset by actively augmenting the feature space to discover potentially unknown labels.
Abstract: In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to the known labels. In this paper, we study a new problem setting in which there are unknown classes in the training dataset misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates the training dataset by actively augmenting the feature space to discover potentially unknown labels. Our approach consists of three ingredients including rejection model, feature acquisition, and model cascade. The effectiveness is validated on both synthetic and real datasets.

Journal ArticleDOI
TL;DR: In this article, a simple and effective approach with three main strategies for efficient learning of deep forest is proposed, which substantially reduces the number of instances that need to be processed through redirecting instances having high predictive confidence straight to the final level for prediction, bypassing all the intermediate levels.
Abstract: Most studies about deep learning are based on neural network models, where many layers of parameterized nonlinear differentiable modules are trained by backpropagation. Recently, it has been shown that deep learning can also be realized by non-differentiable modules without backpropagation training called deep forest. We identify that deep forest has high time costs and memory requirements---this has inhibited its use on large-scale datasets. In this paper, we propose a simple and effective approach with three main strategies for efficient learning of deep forest. First, it substantially reduces the number of instances that needs to be processed through redirecting instances having high predictive confidence straight to the final level for prediction, by-passing all the intermediate levels. Second, many non-informative features are screened out, and only the informative ones are used for learning at each level. Third, an unsupervised feature transformation procedure is proposed to replace the supervised multi-grained scanning procedure. Our theoretical analysis supports the proposed approach in varying the model complexity from low to high as the number of levels increases in deep forest. Experiments show that our approach achieves highly competitive predictive performance with reduced time cost and memory requirement by one to two orders of magnitude.

Journal ArticleDOI
03 Apr 2020
TL;DR: This paper proposes Private-Public Stochastic Gradient Descent, which utilizes such public information to adjust parameters in differentially private stochastic gradient descent and fine-tunes the final result with model reuse.
Abstract: Differentially private learning tackles tasks where the data are private and the learning process is subject to differential privacy requirements. In real applications, however, some public data are generally available in addition to private data, and it is interesting to consider how to exploit them. In this paper, we study a common situation where a small amount of public data can be used when solving the Empirical Risk Minimization problem over a private database. Specifically, we propose Private-Public Stochastic Gradient Descent, which utilizes such public information to adjust parameters in differentially private stochastic gradient descent and fine-tunes the final result with model reuse. Our method keeps differential privacy for the private database, and empirical study validates its superiority compared with existing approaches.

Proceedings Article
01 Feb 2020
TL;DR: In this article, the authors propose the exploratory machine learning, where in this paradigm once user encounters unsatisfactory learning performance, she can examine the possibility and, if unknown unknowns really exist, deploy the optimal strategy of feature space augmentation to make the unknown classes observable and enable for learning.
Abstract: In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to known labels. In real situations, when the learned models do not work well, users generally attribute the failure to the inadequate selection of learning algorithms or the lack of enough labeled training samples. In this paper, we point out that there is an important category of failure, which owes to the fact that there are unknown classes in the training data misperceived as other labels, and thus their existence was unknown from the given supervision. Such problems of unknown unknown classes can hardly be addressed by common re-selection of algorithms or accumulation of training samples. For this purpose, we propose the exploratory machine learning, where in this paradigm once user encounters unsatisfactory learning performance, she can examine the possibility and, if unknown unknowns really exist, deploy the optimal strategy of feature space augmentation to make the unknown classes observable and be enabled for learning. Theoretical analysis and empirical study on both synthetic and real datasets validate the efficacy of our proposal.

Posted Content
TL;DR: This work proposes the soft Gradient Boosting Machine (sGBM) by wiring multiple differentiable base learners together, by injecting both local and global objectives inspired from gradient boosting, all base learners can then be jointly optimized with linear speed-up.
Abstract: Gradient Boosting Machine has proven to be one successful function approximator and has been widely used in a variety of areas. However, since the training procedure of each base learner has to take the sequential order, it is infeasible to parallelize the training process among base learners for speed-up. In addition, under online or incremental learning settings, GBMs achieved sub-optimal performance due to the fact that the previously trained base learners can not adapt with the environment once trained. In this work, we propose the soft Gradient Boosting Machine (sGBM) by wiring multiple differentiable base learners together, by injecting both local and global objectives inspired from gradient boosting, all base learners can then be jointly optimized with linear speed-up. When using differentiable soft decision trees as base learner, such device can be regarded as an alternative version of the (hard) gradient boosting decision trees with extra benefits. Experimental results showed that, sGBM enjoys much higher time efficiency with better accuracy, given the same base learner in both on-line and off-line settings.

Posted Content
TL;DR: This study provides an alternative basic building block in neural networks and exhibits the feasibility of developing artificial neural networks with neuronal plasticity.
Abstract: Current neural networks are mostly built upon the MP model, which usually formulates the neuron as executing an activation function on the real-valued weighted aggregation of signals received from other neurons. In this paper, we propose the Flexible Transmitter (FT) model, a novel bio-plausible neuron model with flexible synaptic plasticity. The FT model employs a pair of parameters to model the transmitters between neurons and puts up a neuron-exclusive variable to record the regulated neurotrophin density, which leads to the formulation of the FT model as a two-variable two-valued function, taking the commonly-used MP neuron model as its special case. This modeling manner makes the FT model not only biologically more realistic, but also capable of handling complicated data, even time series. To exhibit its power and potential, we present the Flexible Transmitter Network (FTNet), which is built on the most common fully-connected feed-forward architecture taking the FT model as the basic building block. FTNet allows gradient calculation and can be implemented by an improved back-propagation algorithm in the complex-valued domain. Experiments on a board range of tasks show the superiority of the proposed FTNet. This study provides an alternative basic building block in neural networks and exhibits the feasibility of developing artificial neural networks with neuronal plasticity.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this paper, a measure-aware feature reuse mechanism was proposed to reuse the good representation in the previous layer guided by confidence and a measureaware layer growth mechanism was designed to gradually increase the model complexity by performance measure.
Abstract: In multi-label learning, each instance is associated with multiple labels and the crucial task is how to leverage label correlations in building models. Deep neural network methods usually jointly embed the feature and label information into a latent space to exploit label correlations. However, the success of these methods highly depends on the precise choice of model depth. Deep forest is a recent deep learning framework based on tree model ensembles, which does not rely on backpropagation. We consider the advantages of deep forest models are very appropriate for solving multi-label problems. Therefore we design the Multi-Label Deep Forest (MLDF) method with two mechanisms: measure-aware feature reuse and measure-aware layer growth. The measure-aware feature reuse mechanism reuses the good representation in the previous layer guided by confidence. The measure-aware layer growth mechanism ensures MLDF gradually increase the model complexity by performance measure. MLDF handles two challenging problems at the same time: one is restricting the model complexity to ease the overfitting issue; another is optimizing the performance measure on user's demand since there are many different measures in the multi-label evaluation. Experiments show that our proposal not only beats the compared methods over six measures on benchmark datasets but also enjoys label correlation discovery and other desired properties in multi-label learning.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: It is proved that the stochastic process led by HRP under weak dependence condition is predictive PAC learnable and the model is able to deal with irregular nonstationary signals.
Abstract: In this paper, we propose the Harmonic Recurrent Process (HRP) for forecasting non-stationary time series with periodvarying patterns. HRP works by selectively ensembling recurrent period-varying patterns in harmonic analysis. In contrast to classical forecasting approaches that rely on stationary priors and recurrent neural network approaches that are mostly black boxes, our model is able to deal with irregular nonstationary signals, and its working mechanism is reasonably lucid. We also prove that the stochastic process led by HRP under weak dependence condition is predictive PAC learnable. Comprehensive experiments on simulated and practical tasks validate the effectiveness of HRP.

Posted Content
TL;DR: Storage-Fit Feature-Evolvable streaming Learning (SF2EL) which incorporates the issue of rarely-provided labels into feature evolution and can preserve the merit of the original feature evolvable learning i.e., can always track the best baseline and thus perform well at any time step.
Abstract: Feature evolvable learning has been widely studied in recent years where old features will vanish and new features will emerge when learning with streams. Conventional methods usually assume that a label will be revealed after prediction at each time step. However, in practice, this assumption may not hold whereas no label will be given at most time steps. A good solution is to leverage the technique of manifold regularization to utilize the previous similar data to assist the refinement of the online model. Nevertheless, this approach needs to store all previous data which is impossible in learning with streams that arrive sequentially in large volume. Thus we need a buffer to store part of them. Considering that different devices may have different storage budgets, the learning approaches should be flexible subject to the storage budget limit. In this paper, we propose a new setting: Storage-Fit Feature-Evolvable streaming Learning (SF$^2$EL) which incorporates the issue of rarely-provided labels into feature evolution. Our framework is able to fit its behavior to different storage budgets when learning with feature evolvable streams with unlabeled data. Besides, both theoretical and empirical results validate that our approach can preserve the merit of the original feature evolvable learning i.e., can always track the best baseline and thus perform well at any time step.

Proceedings Article
01 Jan 2020
TL;DR: In this paper, the authors propose novel online algorithms that are capable of leveraging smoothness and replace the dependence on $T$ in the dynamic regret by problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of the previous two terms.
Abstract: We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is $\mathcal{O}(\sqrt{T(1+P_T)})$. Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the dynamic regret by exploiting the smoothness condition. Specifically, we propose novel online algorithms that are capable of leveraging smoothness and replace the dependence on $T$ in the dynamic regret by problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of the previous two terms. These quantities are at most $\mathcal{O}(T)$ while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile guarantee the same rate in the worst case.

Posted Content
TL;DR: In this article, a two-phase framework is presented to find models that are helpful for the current application, without accessing the raw training data for the models in the pool, and the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification.
Abstract: Given a publicly available pool of machine learning models constructed for various tasks, when a user plans to build a model for her own machine learning application, is it possible to build upon models in the pool such that the previous efforts on these existing models can be reused rather than starting from scratch? Here, a grand challenge is how to find models that are helpful for the current application, without accessing the raw training data for the models in the pool. In this paper, we present a two-phase framework. In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model. Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification. Theoretical results and extensive experiments validate the effectiveness of our approach.

Proceedings Article
25 Sep 2020
TL;DR: This work proposes MoreBoost, a simple yet powerful boosting algorithm to achieve effective model reuse under the idealized assumption that the reusability indicators are noise-free, and strengthens MoreBoost with an active rectification mechanism, allowing the learner to query ground-truth indicator values from the model providers actively.
Abstract: We study the following model reuse problem: a learner needs to select a subset of models from a model pool to classify an unlabeled dataset without accessing the raw training data of the models. Under this situation, it is challenging to properly estimate the reusability of the models in the pool. In this work, we consider the model reuse protocol under which the learner receives specifications of the models, including reusability indicators to verify the models’ prediction accuracy on any unlabeled instances. We propose MoreBoost, a simple yet powerful boosting algorithm to achieve effective model reuse under the idealized assumption that the reusability indicators are noise-free. When the reusability indicators are noisy, we strengthen MoreBoost with an active rectification mechanism, allowing the learner to query ground-truth indicator values from the model providers actively. The resulted MoreBoost.AR algorithm is guaranteed to significantly reduce the prediction error caused by the indicator noise. We also conduct experiments on both synthetic and benchmark datasets to verify the performance of the proposed approaches.

Posted Content
TL;DR: This paper shows for the first time that an effective kernel based anomaly detector based on kernel mean embedding must employ a characteristic kernel which is data dependent, and introduces an IDK based detector called IDK$^2, which runs orders of magnitude faster than group anomaly detector OCSMM.
Abstract: We introduce Isolation Distributional Kernel as a new way to measure the similarity between two distributions. Existing approaches based on kernel mean embedding, which convert a point kernel to a distributional kernel, have two key issues: the point kernel employed has a feature map with intractable dimensionality; and it is {\em data independent}. This paper shows that Isolation Distributional Kernel (IDK), which is based on a {\em data dependent} point kernel, addresses both key issues. We demonstrate IDK's efficacy and efficiency as a new tool for kernel based anomaly detection for both point and group anomalies. Without explicit learning, using IDK alone outperforms existing kernel based point anomaly detector OCSVM and other kernel mean embedding methods that rely on Gaussian kernel. For group anomaly detection,we introduce an IDK based detector called IDK$^2$. It reformulates the problem of group anomaly detection in input space into the problem of point anomaly detection in Hilbert space, without the need for learning. IDK$^2$ runs orders of magnitude faster than group anomaly detector OCSMM.We reveal for the first time that an effective kernel based anomaly detector based on kernel mean embedding must employ a characteristic kernel which is data dependent.

Proceedings Article
21 Nov 2020
TL;DR: This paper proposes a novel approach which is able to cost-effectively identify the causal effects, by an active strategy introducing limited interventions, and thus guide decision-making and theoretical analysis and empirical studies validate the effectiveness of the proposed approach.
Abstract: In many real tasks, we care about how to make decisions rather than mere predictions on an event, e.g. how to increase the revenue next month instead of merely knowing it will drop. The key is to identify the causal effects on the desired event. It is achievable with do-calculus if the causal structure is known; however, in many real tasks it is not easy to infer the whole causal structure with the observational data. Introducing external interventions is needed to achieve it. In this paper, we study the situation where only the response variable is observable under intervention. We propose a novel approach which is able to cost-effectively identify the causal effects, by an active strategy introducing limited interventions, and thus guide decision-making. Theoretical analysis and empirical studies validate the effectiveness of the proposed approach.

Posted Content
TL;DR: In this paper, an evaluator-generator framework was proposed for learning-to-rank (LTR) with item context, which consists of a generalizer that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluators score by reinforcement learning.
Abstract: Learning-to-rank (LTR) has become a key technology in E-commerce applications. Most existing LTR approaches follow a supervised learning paradigm from offline labeled data collected from the online system. However, it has been noticed that previous LTR models can have a good validation performance over offline validation data but have a poor online performance, and vice versa, which implies a possible large inconsistency between the offline and online evaluation. We investigate and confirm in this paper that such inconsistency exists and can have a significant impact on AliExpress Search. Reasons for the inconsistency include the ignorance of item context during the learning, and the offline data set is insufficient for learning the context. Therefore, this paper proposes an evaluator-generator framework for LTR with item context. The framework consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning, and a discriminator that ensures the generalization of the evaluator. Extensive experiments in simulation environments and AliExpress Search online system show that, firstly, the classic data-based metrics on the offline dataset can show significant inconsistency with online performance, and can even be misleading. Secondly, the proposed evaluator score is significantly more consistent with the online performance than common ranking metrics. Finally, as the consequence, our method achieves a significant improvement (\textgreater$2\%$) in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.