Density-ratio based clustering for discovering clusters with varying densities
- Authors: Zhu, Ye , Ting, Kaiming , Carman, Mark
- Date: 2016
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 60, no. (2016), p. 983-997
- Full Text: false
- Reviewed:
- Description: Density-based clustering algorithms are able to identify clusters of arbitrary shapes and sizes in a dataset which contains noise. It is well-known that most of these algorithms, which use a global density threshold, have difficulty identifying all clusters in a dataset having clusters of greatly varying densities. This paper identifies and analyses the condition under which density-based clustering algorithms fail in this scenario. It proposes a density-ratio based method to overcome this weakness, and reveals that it can be implemented in two approaches. One approach is to modify a density-based clustering algorithm to do density-ratio based clustering by using its density estimator to compute density-ratio. The other approach involves rescaling the given dataset only. An existing density-based clustering algorithm, which is applied to the rescaled dataset, can find all clusters with varying densities that would otherwise impossible had the same algorithm been applied to the unscaled dataset. We provide an empirical evaluation using DBSCAN, OPTICS and SNN to show the effectiveness of these two approaches. © 2016 Elsevier Ltd
Commentary : A decomposition of the outlier detection problem into a set of supervised learning problems
- Authors: Zhu, Ye , Ting, Kaiming
- Date: 2016
- Type: Text , Journal article
- Relation: Machine Learning Vol. 105, no. 2 (2016), p. 301-304
- Full Text: false
- Reviewed:
- Description: This article discusses the material in relation to iForest (Liu et al. in ACM Trans Knowl Discov Data 6(1):3, 2012) reported in a recent Machine Learning Journal paper by Paulheim and Meusel (Mach Learn 100(2–3):509–531, 2015). It presents an empirical comparison result of iForest using the default parameter settings suggested by its creator (Liu et al. 2012) and iForest using the settings employed by Paulheim and Meusel (2015). This comparison has an impact on the conclusion made by Paulheim and Meusel (2015). © 2016, The Author(s).
Grouping points by shared subspaces for effective subspace clustering
- Authors: Zhu, Ye , Ting, Kaiming , Carman, Mark
- Date: 2018
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 83, no. (2018), p. 230-244
- Full Text: false
- Reviewed:
- Description: Clusters may exist in different subspaces of a multidimensional dataset. Traditional full-space clustering algorithms have difficulty in identifying these clusters. Various subspace clustering algorithms have used different subspace search strategies. They require clustering to assess whether cluster(s) exist in a subspace. In addition, all of them perform clustering by measuring similarity between points in the given feature space. As a result, the subspace selection and clustering processes are tightly coupled. In this paper, we propose a new subspace clustering framework named CSSub (Clustering by Shared Subspaces). It enables neighbouring core points to be clustered based on the number of subspaces they share. It explicitly splits candidate subspace selection and clustering into two separate processes, enabling different types of cluster definitions to be employed easily. Through extensive experiments on synthetic and real-world datasets, we demonstrate that CSSub discovers non-redundant subspace clusters with arbitrary shapes in noisy data; and it significantly outperforms existing state-of-the-art subspace clustering algorithms.
Isolation-based anomaly detection using nearest-neighbor ensembles
- Authors: Bandaragoda, Tharindu , Ting, Kaiming , Albrecht, David , Liu, Fei , Zhu, Ye , Wells, Jonathan
- Date: 2018
- Type: Text , Journal article
- Relation: Computational Intelligence Vol. 34, no. 4 (2018), p. 968-998
- Full Text: false
- Reviewed:
- Description: The first successful isolation-based anomaly detector, ie, iForest, uses trees as a means to perform isolation. Although it has been shown to have advantages over existing anomaly detectors, we have identified 4 weaknesses, ie, its inability to detect local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis-parallel clusters, and anomalies in multimodal data sets. To overcome these weaknesses, this paper shows that an alternative isolation mechanism is required and thus presents iNNE or isolation using Nearest Neighbor Ensemble. Although relying on nearest neighbors, iNNE runs significantly faster than the existing nearest neighbor–based methods such as the local outlier factor, especially in data sets having thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant space complexity. © 2018 Wiley Periodicals, Inc.
Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure
- Authors: Ting, Kaiming , Zhu, Ye , Carman, Mark , Zhu, Yue , Zhi-Hua, Zhou
- Date: 2016
- Type: Text , Conference paper
- Relation: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco; August 13th-17th, 2016 p. 1205-1214
- Full Text: false
- Reviewed:
- Description: This paper introduces the first generic version of data dependent dissimilarity and shows that it provides a better closest match than distance measures for three existing algorithms in clustering, anomaly detection and multi-label classification. For each algorithm, we show that by simply replacing the distance measure with the data dependent dissimilarity measure, it overcomes a key weakness of the otherwise unchanged algorithm.
Local contrast as an effective means to robust clustering against varying densities
- Authors: Chen, Bo , Ting, Kaiming , Washio, Takashi , Zhu, Ye
- Date: 2018
- Type: Text , Journal article
- Relation: Machine Learning Vol. 107, no. 8-10 (2018), p. 1621-1645
- Full Text:
- Reviewed:
- Description: Most density-based clustering methods have difficulties detecting clusters of hugely different densities in a dataset. A recent density-based clustering CFSFDP appears to have mitigated the issue. However, through formalising the condition under which it fails, we reveal that CFSFDP still has the same issue. To address this issue, we propose a new measure called Local Contrast, as an alternative to density, to find cluster centers and detect clusters. We then apply Local Contrast to CFSFDP, and create a new clustering method called LC-CFSFDP which is robust in the presence of varying densities. Our empirical evaluation shows that LC-CFSFDP outperforms CFSFDP and three other state-of-the-art variants of CFSFDP. © 2018, The Author(s).
Nearest-neighbour-induced isolation similarity and its impact on density-based clustering
- Authors: Qin, Xiaoyu , Ting, Kai , Zhu, Ye , Lee, Vincent
- Date: 2019
- Type: Text , Conference paper
- Relation: 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Annual Conference on Innovative Applications of Artificial Intelligence, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 27 January to 1 February 2019. p. 4755-4762
- Full Text:
- Reviewed:
- Description: A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on density-based clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot. © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org).