Description:
Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.
Description:
One common approach in swarm-based clustering is to use agents to create a set of clusters on a two-dimensional grid, and then use an existing clustering method to retrieve the clusters on the grid. The second step, which we call grid-cluster retrieval, is an essential step to obtain an explicit partitioning of data. In this study, we highlight the issues in grid-cluster retrievals commonly neglected by researchers, and demonstrate the non-trivial difficulties involved. To tackle the issues, we then evaluate three methods: K-means, hierarchical clustering (Weighted Single-link) and density-based clustering (DBScan). Among the three methods, DBScan is the only method which has not been previously used for grid-cluster retrievals, yet it is shown to be the most suitable method in terms of effectiveness and efficiency.