Your selections:

2Clustering problems
2Nonsmooth optimization
2Optimization
2Problem solving
10102 Applied Mathematics
10103 Numerical and Computational Mathematics
10802 Computation Theory and Mathematics
1Algorithm for solving
1Algorithms
1Attribution
1Authorship
1Bundle methods
1Classification algorithms
1Computer crime
1Computer hardware
1Conventional clustering
1Correlation
1Crime

Show More

Show Less

Format Type

A heuristic algorithm for solving the minimum sum-of-squares clustering problems

**Authors:**Ordin, Burak , Bagirov, Adil**Date:**2015**Type:**Text , Journal article**Relation:**Journal of Global Optimization Vol. 61, no. 2 (2015), p. 341-361**Relation:**http://purl.org/au-research/grants/arc/DP140103213**Full Text:**false**Reviewed:****Description:**Clustering is an important task in data mining. It can be formulated as a global optimization problem which is challenging for existing global optimization techniques even in medium size data sets. Various heuristics were developed to solve the clustering problem. The global k-means and modified global k-means are among most efficient heuristics for solving the minimum sum-of-squares clustering problem. However, these algorithms are not always accurate in finding global or near global solutions to the clustering problem. In this paper, we introduce a new algorithm to improve the accuracy of the modified global k-means algorithm in finding global solutions. We use an auxiliary cluster problem to generate a set of initial points and apply the k-means algorithm starting from these points to find the global solution to the clustering problems. Numerical results on 16 real-world data sets clearly demonstrate the superiority of the proposed algorithm over the global and modified global k-means algorithms in finding global solutions to clustering problems.

Empirical investigation of consensus clustering for large ECG data sets

- Kelarev, Andrei, Stranieri, Andrew, Yearwood, John, Jelinek, Herbert

**Authors:**Kelarev, Andrei , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert**Date:**2012**Type:**Text , Conference proceedings**Full Text:**false**Description:**This article investigates a novel machine learning approach applying consensus clustering in conjunction with classification for the data mining of very large and highly dimensional ECG data sets. To obtain robust and stable clusterings, consensus functions can be applied for clustering ensembles combining a multitude of independent initial clusterings. Direct applications of consensus functions to highly dimensional ECG data sets remain computationally expensive and impracticable. We introduce a multistage scheme including various procedures for dimensionality reduction, consensus clustering of randomized samples, followed by the use of a fast supervised classification algorithm. Applying the Hybrid Bipartite Graph Formulation combined with rank ordering and SMO we obtained an area under the receiver operating curve of 0.987. The performance of the classification algorithm at the final stage is crucial for the effectiveness of this technique. It can be regarded as an indication of the reliability, quality and stability of the combined consensus clustering. © 2012 IEEE.

Malicious Spam Emails Developments and Authorship Attribution

- Alazab, Mamoun, Layton, Robert, Broadhurst, Roderic, Bouhours, Brigitte

**Authors:**Alazab, Mamoun , Layton, Robert , Broadhurst, Roderic , Bouhours, Brigitte**Date:**2013**Type:**Text , Conference paper**Relation:**Proceedings - 4th Cybercrime and Trustworthy Computing Workshop, CTC 2013 p. 58-68**Full Text:**false**Reviewed:****Description:**The Internet is a decentralized structure that offers speedy communication, has a global reach and provides anonymity, a characteristic invaluable for committing illegal activities. In parallel with the spread of the Internet, cybercrime has rapidly evolved from a relatively low volume crime to a common high volume crime. A typical example of such a crime is the spreading of spam emails, where the content of the email tries to entice the recipient to click a URL linking to a malicious Web site or downloading a malicious attachment. Analysts attempting to provide intelligence on spam activities quickly find that the volume of spam circulating daily is overwhelming; therefore, any intelligence gathered is representative of only a small sample, not of the global picture. While past studies have looked at automating some of these analyses using topic-based models, i.e. separating email clusters into groups with similar topics, our preliminary research investigates the usefulness of applying authorship-based models for this purpose. In the first phase, we clustered a set of spam emails using an authorship-based clustering algorithm. In the second phase, we analysed those clusters using a set of linguistic, structural and syntactic features. These analyses reveal that emails within each cluster were likely written by the same author, but that it is unlikely we have managed to group together all spam produced by each group. This problem of high purity with low recall, has been faced in past authorship research. While it is also a limitation of our research, the clusters themselves are still useful for the purposes of automating analysis, because they reduce the work needing to be performed. Our second phase revealed useful information on the group that can be utilized in future research for further analysis of such groups, for example, identifying further linkages behind spam campaigns.

New diagonal bundle method for clustering problems in large data sets

- Karmitsa, Napsu, Bagirov, Adil, Taheri, Sona

**Authors:**Karmitsa, Napsu , Bagirov, Adil , Taheri, Sona**Date:**2017**Type:**Text , Journal article**Relation:**European Journal of Operational Research Vol. 263, no. 2 (2017), p. 367-379**Relation:**http://purl.org/au-research/grants/arc/DP140103213**Full Text:**false**Reviewed:****Description:**Clustering is one of the most important tasks in data mining. Recent developments in computer hardware allow us to store in random access memory (RAM) and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes it possible to use conventional clustering algorithms in such data sets. However, these algorithms may need prohibitively large computational time and fail to produce accurate solutions. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in large data sets. This paper introduces one of them. Using nonsmooth optimization formulation of the clustering problem the objective function is represented as a difference of two convex (DC) functions. Then a new diagonal bundle algorithm that explicitly uses this structure is designed and combined with an incremental approach to solve this problem. The method is evaluated using real world data sets with both large number of attributes and large number of data points. The proposed method is compared with two other clustering algorithms using numerical results. © 2017 Elsevier B.V.

Accurate and efficient clustering algorithms for very large data sets

**Authors:**Quddus, Syed**Date:**2017**Type:**Text , Thesis , PhD**Full Text:****Description:**The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.**Description:**Doctor of Philosophy

**Authors:**Quddus, Syed**Date:**2017**Type:**Text , Thesis , PhD**Full Text:****Description:**The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.**Description:**Doctor of Philosophy

- «
- ‹
- 1
- ›
- »

Are you sure you would like to clear your session, including search history and login status?