Neighbourhood contrast : A better means to detect clusters than density
- Authors: Chen, Bo , Ting, Kaiming
- Date: 2018
- Type: Text , Conference paper
- Relation: 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018; Melbourne, Australia; 3rd-6th June 2018; published in Lecutre Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10939 LNAI, p. 401-412
- Full Text: false
- Reviewed:
- Description: Most density-based clustering algorithms suffer from large density variations among clusters. This paper proposes a new measure called Neighbourhood Contrast (NC) as a better alternative to density in detecting clusters. The proposed NC admits all local density maxima, regardless of their densities, to have similar NC values. Due to this unique property, NC is a better means to detect clusters in a dataset with large density variations among clusters. We provide two applications of NC. First, replacing density with NC in the current state-of-the-art clustering procedure DP leads to significantly improved clustering performance. Second, we devise a new clustering algorithm called Neighbourhood Contrast Clustering (NCC) which does not require density or distance calculations, and therefore has a linear time complexity in terms of dataset size. Our empirical evaluation shows that both NC-based methods outperform density-based methods including the current state-of-the-art.
LiNearN : A new approach to nearest neighbour density estimator
- Authors: Wells, Jonathan , Ting, Kaiming , Washio, Takashi
- Date: 2014
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 47, no. 8 (2014), p. 2702-2720
- Full Text: false
- Reviewed:
- Description: Despite their wide spread use, nearest neighbour density estimators have two fundamental limitations: O(n2) time complexity and O(n) space complexity. Both limitations constrain nearest neighbour density estimators to small data sets only. Recent progress using indexing schemes has improved to near linear time complexity only.We propose a new approach, called LiNearN for Linear time Nearest Neighbour algorithm, that yields the first nearest neighbour density estimator having O(n) time complexity and constant space complexity, as far as we know. This is achieved without using any indexing scheme because LiNearN uses a subsampling approach for which the subsample values are significantly less than the data size. Like existing density estimators, our asymptotic analysis reveals that the new density estimator has a parameter to trade off between bias and variance. We show that algorithms based on the new nearest neighbour density estimator can easily scale up to data sets with millions of instances in anomaly detection and clustering tasks. Highlights•Reject the premise that a NN algorithm must find the NN for every instance.•The first NN density estimator that has O(n) time complexity and O(1) space complexity.•These complexities are achieved without using any indexing scheme.•Our asymptotic analysis reveals that it trades off between bias and variance.•Easily scales up to large data sets in anomaly detection and clustering tasks.
A general stochastic clustering method for automatic cluster discovery
- Authors: Tan, Swee , Ting, Kaiming , Teng, Shyh
- Date: 2011
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 44, no. 10-11 (2011), p. 2786-2799
- Full Text: false
- Reviewed:
- Description: Finding clusters in data is a challenging problem. Given a dataset, we usually do not know the number of natural clusters hidden in the dataset. The problem is exacerbated when there is little or no additional information except the data itself. This paper proposes a general stochastic clustering method that is a simplification of nature-inspired ant-based clustering approach. It begins with a basic solution and then performs stochastic search to incrementally improve the solution until the underlying clusters emerge, resulting in automatic cluster discovery in datasets. This method differs from several recent methods in that it does not require users to input the number of clusters and it makes no explicit assumption about the underlying distribution of a dataset. Our experimental results show that the proposed method performs better than several existing methods in terms of clustering accuracy and efficiency in majority of the datasets used in this study. Our theoretical analysis shows that the proposed method has linear time and space complexities, and our empirical study shows that it can accurately and efficiently discover clusters in large datasets in which many existing methods fail to run.