Classification under streaming emerging new classes : A solution using completely-random trees
- Authors: Mu, Xin , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2017
- Type: Text , Journal article
- Relation: IEEE Transactions on Knowledge and Data Engineering Vol. 29, no. 8 (2017), p. 1605-1618
- Full Text:
- Reviewed:
- Description: This paper investigates an important problem in stream mining, i.e., classification under streaming emerging new classes or SENC. The SENC problem can be decomposed into three subproblems: detecting emerging new classes, classifying known classes, and updating models to integrate each new class as part of known classes. The common approach is to treat it as a classification problem and solve it using either a supervised learner or a semi-supervised learner. We propose an alternative approach by using unsupervised learning as the basis to solve this problem. The proposed method employs completely-random trees which have been shown to work well in unsupervised learning and supervised learning independently in the literature. The completely-random trees are used as a single common core to solve all three subproblems: unsupervised learning, supervised learning, and model update on data streams. We show that the proposed unsupervised-learning-focused method often achieves significantly better outcomes than existing classification-focused methods.
Discover multiple novel labels in multi-instance multi-label learning
- Authors: Zhu, Yue , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2017
- Type: Text , Conference paper
- Relation: Thirty-First AAAI Conference on Artificial Intelligence p. 2977-2984
- Full Text: false
- Reviewed:
- Description: Multi-instance multi-label learning (MIML) is a learning paradigm where an object is represented by a bag of instances and each bag is associated with multiple labels. Ordinary MIML setting assumes a fixed target label set. In real applications, multiple novel labels may exist outside this set, but hidden in the training data and unknown to the MIML learner. Existing MIML approaches are unable to discover the hidden novel labels, let alone predicting these labels in the previously unseen test data. In this paper, we propose the first approach to discover multiple novel labels in MIML problem using an efficient augmented lagrangian optimization, which has a bag-dependent loss term and a bag-independent clustering regularization term, enabling the known labels and multiple novel labels to be modeled simultaneously. The effectiveness of the proposed approach is validated in experiments.
Improving deep forest by confidence screening
- Authors: Pang, Ming , Ting, Kaiming , Zhao, Peng , Zhou, Zhi-Hua
- Date: 2018
- Type: Text , Conference proceedings
- Relation: 2018 Ieee International Conference on Data Mining; Singapore, Singapore; 17th-20th November 2018 p. 1194-1199
- Full Text:
- Reviewed:
- Description: Most studies about deep learning are based on neural network models, where many layers of parameterized nonlinear differentiable modules are trained by backpropagation. Recently, it has been shown that deep learning can also be realized by non-differentiable modules without backpropagation training called deep forest. The developed representation learning process is based on a cascade of cascades of decision tree forests, where the high memory requirement and the high time cost inhibit the training of large models. In this paper, we propose a simple yet effective approach to improve the efficiency of deep forest. The key idea is to pass the instances with high confidence directly to the final stage rather than passing through all the levels. We also provide a theoretical analysis suggesting a means to vary the model complexity from low to high as the level increases in the cascade, which further reduces the memory requirement and time cost. Our experiments show that the proposed approach achieves highly competitive predictive performance with significantly reduced time cost and memory requirement by up to one order of magnitude.
Isolation forest
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2008
- Type: Text , Conference paper
- Relation: Proceedings of the Eighth IEEE International Conference on Data Mining p. 413-422
- Full Text: false
- Reviewed:
- Description: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.
Isolation kernel and its effect on SVM
- Authors: Ting, Kaiming , Zhu, Yue , Zhou, Zhi-Hua
- Date: 2018
- Type: Text , Conference proceedings , Conference paper
- Relation: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018; London, United Kingdom; 19th-23th August 2018 p. 2329-2337
- Full Text: false
- Reviewed:
- Description: This paper investigates data dependent kernels that are derived directly from data. This has been an outstanding issue for about two decades which hampered the development of kernel-based methods. We introduce Isolation Kernel which is solely dependent on data distribution, requiring neither class information nor explicit learning to be a classifier. In contrast, existing data dependent kernels rely heavily on class information and explicit learning to produce a classifier. We show that Isolation Kernel approximates well to a data independent kernel function called Laplacian kernel under uniform density distribution. With this revelation, Isolation Kernel can be viewed as a data dependent kernel that adapts a data independent kernel to the structure of a dataset. We also provide a reason why the proposed new data dependent kernel enables SVM (which employs a kernel through other means) to improve its predictive accuracy. The key differences between Random Forest kernel and Isolation Kernel are discussed to examine the reasons why the latter is a more successful tree-based kernel.
Isolation set-kernel and its application to multi-instance learning
- Authors: Xu, Bi-Cun , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2019
- Type: Text , Conference proceedings , Conference paper
- Relation: 25th ACM SIGKDD International Conferencce on Knowledge Discovery and Data Mining, KDD 2019; Anchorage, United States; 4th-8th August 2019 p. 941-949
- Full Text: false
- Reviewed:
- Description: Set-level problems are as important as instance-level problems. The core in solving set-level problems is: how to measure the similarity between two sets. This paper investigates data-dependent kernels that are derived directly from data. We introduce Isolation Set Kernel which is solely dependent on data distribution, requiring neither class information nor explicit learning. In contrast, most current set-similarities are not dependent on the underlying data distribution. We theoretically analyze the characteristic of Isolation Set-Kernel. As the set-kernel has a finite feature map, we show that it can be used to speed up the set-kernel computation significantly. We apply Isolation Set-Kernel to Multi-Instance Learning (MIL) using SVM classifier, and demonstrate that it outperforms other set-kernels or other solutions to the MIL problem.
Isolation-based anomaly detection
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2012
- Type: Text , Journal article
- Relation: ACM Transactions on Knowledge Discovery from Data Vol. 6, no. 1 (2012), p. 1-39
- Full Text: false
- Reviewed:
- Description: Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods. As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample
Multi-label learning with emerging new labels
- Authors: Zhu, Yue , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2018
- Type: Text , Journal article
- Relation: IEEE Transactions on Knowledge and Data Engineering Vol. 30, no. 10 (2018), p. 1901-1914
- Full Text:
- Reviewed:
- Description: In a multi-label learning task, an object possesses multiple concepts where each concept is represented by a class label. Previous studies on multi-label learning have focused on a fixed set of class labels, i.e., the class label set of test data is the same as that in the training set. In many applications, however, the environment is dynamic and new concepts may emerge in a data stream. In order to maintain a good predictive performance in this environment, a multi-label learning method must have the ability to detect and classify instances with emerging new labels. To this end, we propose a new approach called Multi-label learning with Emerging New Labels (MuENL). It has three functions: classify instances on currently known labels, detect the emergence of a new label, and construct a new classifier for each new label that works collaboratively with the classifier for known labels. In addition, we show that MuENL can be easily extended to handle sparse high dimensional data streams by simply reducing the original dimensionality, and then applying MuENL on the reduced dimensional space. Our empirical evaluation shows the effectiveness of MuENL on several benchmark datasets and MuENLHD on the sparse high dimensional Weibo dataset.
New class adaptation via instance generation in one-pass class incremental learning
- Authors: Zhu, Yue , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2017
- Type: Text , Conference paper
- Relation: 2017 IEEE International Conference on Data Mining (ICDM)
- Full Text: false
- Reviewed:
- Description: One pass learning updates a model with only a single scan of the dataset, without storing historical data. Previous studies focus on classification tasks with a fixed class set, and will perform poorly in an open dynamic environment when new classes emerge in a data stream. The performance degrades because the classifier needs to receive a sufficient number of instances from new classes to establish a good model. This can take a long period of time. In order to reduce this period to deal with any-time prediction task, we introduce a framework to handle emerging new classes called One-Pass Class Incremental Learning (OPCIL). The central issue in OPCIL is: how to effectively adapt a classifier of existing classes to incorporate emerging new classes. We call it the new class adaptation issue, and propose a new approach to address it, which requires only one new class instance. The key is to generate pseudo instances which are optimized to satisfy properties that produce a good discriminative classifier. We provide the necessary propertiesand optimization procedures required to address this issue. Experiments validate the effectiveness of this approach.
On detecting clustered anomalies using SCiForest
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2010
- Type: Text , Conference paper
- Relation: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) p. 274-290
- Full Text: false
- Reviewed:
Spectrum of Variable-Random trees
- Authors: Liu, Fei , Ting, Kaiming , Yu, Yang , Zhou, Zhi-Hua
- Date: 2008
- Type: Text , Journal article
- Relation: The Journal of Artificial Intelligence Research Vol. 32, no. (2008), p. 355-384
- Full Text: false
- Reviewed:
- Description: In this paper, we show that a continuous spectrum of randomisation exists, in which most existing tree randomisations are only operating around the two ends of the spectrum. That leaves a huge part of the spectrum largely unexplored. We propose a base learner VR-Tree which generates trees with variable-randomness. VR-Trees are able to span from the conventional deterministic trees to the complete-random trees using a probabilistic parameter. Using VR-Trees as the base models, we explore the entire spectrum of randomised ensembles, together with Bagging and Random Subspace. We discover that the two halves of the spectrum have their distinct characteristics; and the understanding of which allows us to propose a new approach in building better decision tree ensembles. We name this approach Coalescence, which coalesces a number of points in the random-half of the spectrum. Coalescence acts as a committee of "experts" to cater for unforeseeable conditions presented in training data. Coalescence is found to perform better than any single operating point in the spectrum, without the need to tune to a specific level of randomness. In our empirical study, Coalescence ranks top among the benchmarking ensemble methods including Random Forests, Random Subspace and C5 Boosting; and only Coalescence is significantly better than Bagging and Max-Diverse Ensemble among all the methods in the comparison. Although Coalescence is not significantly better than Random Forests, we have identified conditions under which one will perform better than the other.