A comparative study of practical stochastic clustering method with traditional methods
- Authors: Tan, Swee , Ting, Kaiming , Teng, Shyh
- Date: 2010
- Type: Text , Conference paper
- Relation: Proceedings of the 23rd Australasian Joint Conference on Artificial Intelligence p. 112-121
- Full Text: false
- Reviewed:
- Description: In many real-world clustering problems, there usually exist little information about the clusters underlying a certain dataset. For example, the number of clusters hidden in many datasets is usually not known a priori. This is an issue because many traditional clustering methods require such information as input. This paper examines a practical stochastic clustering method (PSCM) that has the ability to find clusters in datasets without requiring users to specify the centroids or the number of clusters. By comparing with traditional methods (k-means, self-organising map and hierarchical clustering methods), the performance of PSCM is found to be robust against overlapping clusters and clusters with uneven sizes. The proposed method also scales well with datasets having varying number of clusters and dimensions. Finally, our experimental results on real-world data confirm that the proposed method performs competitively against the traditional clustering methods in terms of clustering accuracy and efficiency.
A general stochastic clustering method for automatic cluster discovery
- Authors: Tan, Swee , Ting, Kaiming , Teng, Shyh
- Date: 2011
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 44, no. 10-11 (2011), p. 2786-2799
- Full Text: false
- Reviewed:
- Description: Finding clusters in data is a challenging problem. Given a dataset, we usually do not know the number of natural clusters hidden in the dataset. The problem is exacerbated when there is little or no additional information except the data itself. This paper proposes a general stochastic clustering method that is a simplification of nature-inspired ant-based clustering approach. It begins with a basic solution and then performs stochastic search to incrementally improve the solution until the underlying clusters emerge, resulting in automatic cluster discovery in datasets. This method differs from several recent methods in that it does not require users to input the number of clusters and it makes no explicit assumption about the underlying distribution of a dataset. Our experimental results show that the proposed method performs better than several existing methods in terms of clustering accuracy and efficiency in majority of the datasets used in this study. Our theoretical analysis shows that the proposed method has linear time and space complexities, and our empirical study shows that it can accurately and efficiently discover clusters in large datasets in which many existing methods fail to run.
Clustering gene expression data using ant-based heuristics
- Authors: Tan, Swee , Ting, Kaiming , Teng, Shyh
- Date: 2011
- Type: Text , Conference paper
- Relation: IEEE Congress on Evolutionary Computation (IEEE CEC) 2011 p. 1-8
- Full Text: false
- Reviewed:
- Description: ABSTRACT We consider the problem of finding the clusters in novel datasets in which the number of clusters is not known a priori; and little or no additional information is available for users to adjust the parameters in a clustering algorithm. We address this problem using a stochastic algorithm named SATTA (Simplified Adaptive Time Dependent Transporter), which attempts to find clusters without requiring users to specify the number of clusters or adjust any parameters. SATTA is then compared with Expectation Maximization Clustering, which is also able to estimate the number clusters using the principle of maximum likelihood and find the underlying clusters without any human interventions. Our results on seven gene expression datasets show that SATTA significantly outperforms Expectation Maximization Clustering in terms of clustering accuracy and efficiency. We discuss the conceptual differences between SATTA and EMC, which suggests that SATTA is a more promising alternative approach than Expectation Maximization Clustering when little or no additional information is available for clustering novel datasets.
- Description: ABSTRACT We consider the problem of finding the clusters in novel datasets in which the number of clusters is not known a priori; and little or no additional information is available for users to adjust the parameters in a clustering algorithm. We address this problem using a stochastic algorithm named SATTA (Simplified Adaptive Time Dependent Transporter), which attempts to find clusters without requiring users to specify the number of clusters or adjust any parameters. SATTA is then compared with Expectation Maximization Clustering, which is also able to estimate the number clusters using the principle of maximum likelihood and find the underlying clusters without any human interventions. Our results on seven gene expression datasets show that SATTA significantly outperforms Expectation Maximization Clustering in terms of clustering accuracy and efficiency. We discuss the conceptual differences between SATTA and EMC, which suggests that SATTA is a more promising alternative approach than Expectation Maximization Clustering when little or no additional information is available for clustering novel datasets. [less] 0 BOOKMARKS · 54 VIEWS
FaSS : Ensembles for stable learners
- Authors: Ting, Kaiming , Wells, Jonathan , Tan, Swee , Teng, Shyh , Webb, Geoffrey
- Date: 2009
- Type: Text , Conference paper
- Relation: 8th International Workshop on Multipul Classifier Systems (MCS 2009)
- Full Text: false
- Reviewed:
- Description: This paper introduces a new ensemble approach, Feature-Space Subdivision (FaSS), which builds local models instead of global models. FaSS is a generic ensemble approach that can use either stable or unstable models as its base models. In contrast, existing ensemble approaches which employ randomisation can only use unstable models. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble with an increased level of localisation in FaSS. Our empirical evaluation shows that FaSS performs significantly better than boosting in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by FaSS makes SVM ensembles a reality that would otherwise infeasible for large data sets, and FaSS SVM performs better than Boosting J48 and Random Forests when SVM is the preferred base learner
Feature-subspace aggregating: ensembles for stable and unstable learners
- Authors: Ting, Kaiming , Wells, Jonathan , Tan, Swee , Teng, Shyh , Webb, Geoffrey
- Date: 2011
- Type: Text , Journal article
- Relation: Machine Learning Vol. 82, no. 3 (2011), p. 375-397
- Full Text: false
- Reviewed:
- Description: This paper introduces a new ensemble approach, Feature-Subspace Aggregating (Feating), which builds local models instead of global models. Feating is a generic ensemble approach that can enhance the predictive performance of both stable and unstable learners. In contrast, most existing ensemble approaches can improve the predictive performance of unstable learners only. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble through an increased level of localisation in Feating. Our empirical evaluation shows that Feating performs significantly better than Boosting, Random Subspace and Bagging in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by Feating makes feasible SVM ensembles that would otherwise be infeasible for large data sets. When SVM is the preferred base learner, we show that Feating SVM performs better than Boosting decision trees and Random Forests. We further demonstrate that Feating also substantially reduces the error of another stable learner, k-nearest neighbour, and an unstable learner, decision tree.
Issues of grid-cluster retrievals in swarm-based clustering
- Authors: Tan, Swee , Ting, Kaiming , Teng, Shyh
- Date: 2008
- Type: Text , Conference paper
- Relation: Proceedings of the 2008 IEEE World Congress on Computational Intelligence p. 511-518
- Full Text: false
- Reviewed:
- Description: One common approach in swarm-based clustering is to use agents to create a set of clusters on a two-dimensional grid, and then use an existing clustering method to retrieve the clusters on the grid. The second step, which we call grid-cluster retrieval, is an essential step to obtain an explicit partitioning of data. In this study, we highlight the issues in grid-cluster retrievals commonly neglected by researchers, and demonstrate the non-trivial difficulties involved. To tackle the issues, we then evaluate three methods: K-means, hierarchical clustering (Weighted Single-link) and density-based clustering (DBScan). Among the three methods, DBScan is the only method which has not been previously used for grid-cluster retrievals, yet it is shown to be the most suitable method in terms of effectiveness and efficiency.