Isolation forest
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2008
- Type: Text , Conference paper
- Relation: Proceedings of the Eighth IEEE International Conference on Data Mining p. 413-422
- Full Text: false
- Reviewed:
- Description: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.
Isolation-based anomaly detection
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2012
- Type: Text , Journal article
- Relation: ACM Transactions on Knowledge Discovery from Data Vol. 6, no. 1 (2012), p. 1-39
- Full Text: false
- Reviewed:
- Description: Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods. As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample
On detecting clustered anomalies using SCiForest
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2010
- Type: Text , Conference paper
- Relation: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) p. 274-290
- Full Text: false
- Reviewed:
Spectrum of Variable-Random trees
- Authors: Liu, Fei , Ting, Kaiming , Yu, Yang , Zhou, Zhi-Hua
- Date: 2008
- Type: Text , Journal article
- Relation: The Journal of Artificial Intelligence Research Vol. 32, no. (2008), p. 355-384
- Full Text: false
- Reviewed:
- Description: In this paper, we show that a continuous spectrum of randomisation exists, in which most existing tree randomisations are only operating around the two ends of the spectrum. That leaves a huge part of the spectrum largely unexplored. We propose a base learner VR-Tree which generates trees with variable-randomness. VR-Trees are able to span from the conventional deterministic trees to the complete-random trees using a probabilistic parameter. Using VR-Trees as the base models, we explore the entire spectrum of randomised ensembles, together with Bagging and Random Subspace. We discover that the two halves of the spectrum have their distinct characteristics; and the understanding of which allows us to propose a new approach in building better decision tree ensembles. We name this approach Coalescence, which coalesces a number of points in the random-half of the spectrum. Coalescence acts as a committee of "experts" to cater for unforeseeable conditions presented in training data. Coalescence is found to perform better than any single operating point in the spectrum, without the need to tune to a specific level of randomness. In our empirical study, Coalescence ranks top among the benchmarking ensemble methods including Random Forests, Random Subspace and C5 Boosting; and only Coalescence is significantly better than Bagging and Max-Diverse Ensemble among all the methods in the comparison. Although Coalescence is not significantly better than Random Forests, we have identified conditions under which one will perform better than the other.