Clustering in large data sets with the limited memory bundle method
- Authors: Karmitsa, Napsu , Bagirov, Adil , Taheri, Sona
- Date: 2018
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 83, no. (2018), p. 245-259
- Relation: http://purl.org/au-research/grants/arc/DP140103213
- Full Text: false
- Reviewed:
- Description: The aim of this paper is to design an algorithm based on nonsmooth optimization techniques to solve the minimum sum-of-squares clustering problems in very large data sets. First, the clustering problem is formulated as a nonsmooth optimization problem. Then the limited memory bundle method [Haarala et al., 2007] is modified and combined with an incremental approach to design a new clustering algorithm. The algorithm is evaluated using real world data sets with both the large number of attributes and the large number of data points. It is also compared with some other optimization based clustering algorithms. The numerical results demonstrate the efficiency of the proposed algorithm for clustering in very large data sets.
An algorithm for clustering using L1-norm based on hyperbolic smoothing technique
- Authors: Bagirov, Adil , Mohebi, Ehsan
- Date: 2016
- Type: Text , Journal article
- Relation: Computational Intelligence Vol. 32, no. 3 (2016), p. 439-457
- Relation: http://purl.org/au-research/grants/arc/DP140103213
- Full Text: false
- Reviewed:
- Description: Cluster analysis deals with the problem of organization of a collection of objects into clusters based on a similarity measure, which can be defined using various distance functions. The use of different similarity measures allows one to find different cluster structures in a data set. In this article, an algorithm is developed to solve clustering problems where the similarity measure is defined using the L1-norm. The algorithm is designed using the nonsmooth optimization approach to the clustering problem. Smoothing techniques are applied to smooth both the clustering function and the L1-norm. The algorithm computes clusters sequentially and finds global or near global solutions to the clustering problem. Results of numerical experiments using 12 real-world data sets are reported, and the proposed algorithm is compared with two other clustering algorithms. ©2015 Wiley Periodicals, Inc.
Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems
- Authors: Bagirov, Adil , Taheri, Sona , Ugon, Julien
- Date: 2016
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 53, no. (2016), p. 12-24
- Relation: http://purl.org/au-research/grants/arc/DP140103213
- Full Text: false
- Reviewed:
- Description: This paper introduces an algorithm for solving the minimum sum-of-squares clustering problems using their difference of convex representations. A non-smooth non-convex optimization formulation of the clustering problem is used to design the algorithm. Characterizations of critical points, stationary points in the sense of generalized gradients and inf-stationary points of the clustering problem are given. The proposed algorithm is tested and compared with other clustering algorithms using large real world data sets. © 2015 Elsevier Ltd. All rights reserved.
Evaluating authorship distance methods using the positive Silhouette coefficient
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. 19, no. 4 (2013), p. 517-535
- Full Text:
- Reviewed:
- Description: Unsupervised Authorship Analysis (UAA) aims to cluster documents by authorship without knowing the authorship of any documents. An important factor in UAA is the method for calculating the distance between documents. This choice of the authorship distance method is considered more critical to the end result than the choice of cluster analysis algorithm. One method for measuring the correlation between a distance metric and a labelling (such as class values or clusters) is the Silhouette Coefficient (SC). The SC can be leveraged by measuring the correlation between the authorship distance method and the true authorship, evaluating the quality of the distance method. However, we show that the SC can be severely affected by outliers. To address this issue, we introduce the Positive Silhouette Coefficient, given as the proportion of instances with a positive SC value. This metric is not easily altered by outliers and produces a more robust metric. A large number of authorship distance methods are then compared using the PSC, and the findings are presented. This research provides an insight into the efficacy of methods for UAA and presents a framework for testing authorship distance methods.
- Description: C1