Accurate and efficient clustering algorithms for very large data sets

**Authors:**Quddus, Syed**Date:**2017**Type:**Text , Thesis , PhD**Full Text:****Description:**The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.**Description:**Doctor of Philosophy

**Authors:**Quddus, Syed**Date:**2017**Type:**Text , Thesis , PhD**Full Text:****Description:**The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.**Description:**Doctor of Philosophy

Nonsmooth optimization algorithms for clusterwise linear regression

**Authors:**Mirzayeva, Hijran**Date:**2013**Type:**Text , Thesis , PhD**Full Text:**false**Description:**Data mining is about solving problems by analyzing data that present in databases. Supervised and unsupervised data classification (clustering) are among the most important techniques in data mining. Regression analysis is the process of fitting a function (often linear) to the data to discover how one or more variables vary as a function of another. The aim of clusterwise regression is to combine both of these techniques, to discover trends within data, when more than one trend is likely to exist. Clusterwise regression has applications for instance in market segmentation, where it allows one to gather information on customer behaviors for several unknown groups of customers. There exist different methods for solving clusterwise linear regression problems. In spite of that, the development of efficient algorithms for solving clusterwise linear regression problems is still an important research topic. In this thesis our aim is to develop new algorithms for solving clusterwise linear regression problems in large data sets based on incremental and nonsmooth optimization approaches. Three new methods for solving clusterwise linear regression problems are developed and numerically tested on publicly available data sets for regression analysis. The first method is a new algorithm for solving the clusterwise linear regression problems based on their nonsmooth nonconvex formulation. This is an incremental algorithm. The second method is a nonsmooth optimization algorithm for solving clusterwise linear regression problems. Nonsmooth optimization techniques are proposed to use instead of the Sp¨ath algorithm to solve optimization problems at each iteration of the incremental algorithm. The discrete gradient method is used to solve nonsmooth optimization problems at each iteration of the incremental algorithm. This approach allows one to reduce the CPU time and the number of regression problems solved in comparison with the first incremental algorithm. The third algorithm is an algorithm based on an incremental approach and on the smoothing techniques for solving clusterwise linear regression problems. The use of smoothing techniques allows one to apply powerful methods of smooth nonlinear programming to solve clusterwise linear regression problems. Numerical results are presented for all three algorithms using small to large data sets. The new algorithms are also compared with multi-start Sp¨ath algorithm for clusterwise linear regression.**Description:**Doctor of Philosophy

Efficient piecewise linear classifiers and applications

**Authors:**Webb, Dean**Date:**2011**Type:**Text , Thesis , PhD**Full Text:****Description:**Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.**Description:**Doctor of Philosophy

**Authors:**Webb, Dean**Date:**2011**Type:**Text , Thesis , PhD**Full Text:****Description:**Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.**Description:**Doctor of Philosophy

**Authors:**Saunders, Gary**Date:**2006**Type:**Text , Thesis , PhD**Full Text:****Description:**The cost of adverse drug reactions to society in the form of deaths, chronic illness, foetal malformation, and many other effects is quite significant. For example, in the United States of America, adverse reactions to prescribed drugs is around the fourth leading cause of death. The reporting of adverse drug reactions is spontaneous and voluntary in Australia. Many methods that have been used for the analysis of adverse drug reaction data, mostly using a statistical approach as a basis for clinical analysis in drug safety surveillance decision support. This thesis examines new approaches that may be used in the analysis of drug safety data. These methods differ significantly from the statistical methods in that they utilize co variability methods of association to define drug-reaction relationships. Co variability algorithms were developed in collaboration with Musa Mammadov to discover drugs associated with adverse reactions and possible drug-drug interactions. This method uses the system organ class (SOC) classification in the Australian Adverse Drug Reaction Advisory Committee (ADRAC) data to stratify reactions. The text categorization algorithm BoosTexter was found to work with the same drug safety data and its performance and modus operandi was compared to our algorithms. These alternative methods were compared to a standard disproportionality analysis methods for signal detection in drug safety data including the Bayesean mulit-item gamma Poisson shrinker (MGPS), which was found to have a problem with similar reaction terms in a report and innocent by-stander drugs. A classification of drug terms was made using the anatomical-therapeutic-chemical classification (ATC) codes. This reduced the number of drug variables from 5081 drug terms to 14 main drug classes. The ATC classification is structured into a hierarchy of five levels. Exploitation of the ATC hierarchy allows the drug safety data to be stratified in such a way as to make them accessible to powerful existing tools. A data mining method that uses association rules, which groups them on the basis of content, was used as a basis for applying the ATC and SOC ontologies to ADRAC data. This allows different views of these associations (even very rare ones). A signal detection method was developed using these association rules, which also incorporates critical reaction terms.**Description:**Doctor of Philosophy

Derivative-free hybrid methods in global optimization and their applications

**Authors:**Zhang, Jiapu**Date:**2005**Type:**Text , Thesis , PhD**Full Text:****Description:**In recent years large-scale global optimization (GO) problems have drawn considerable attention. These problems have many applications, in particular in data mining and biochemistry. Numerical methods for GO are often very time consuming and could not be applied for high-dimensional non-convex and / or non-smooth optimization problems. The thesis explores reasons why we need to develop and study new algorithms for solving large-scale GO problems .... The thesis presents several derivative-free hybrid methods for large scale GO problems. These methods do not guarantee the calculation of a global solution; however, results of numerical experiments presented in this thesis demonstrate that they, as a rule, calculate a solution which is a global one or close to it. Their applications to data mining problems and the protein folding problem are demonstrated.**Description:**Doctor of Philosophy

**Authors:**Zhang, Jiapu**Date:**2005**Type:**Text , Thesis , PhD**Full Text:****Description:**In recent years large-scale global optimization (GO) problems have drawn considerable attention. These problems have many applications, in particular in data mining and biochemistry. Numerical methods for GO are often very time consuming and could not be applied for high-dimensional non-convex and / or non-smooth optimization problems. The thesis explores reasons why we need to develop and study new algorithms for solving large-scale GO problems .... The thesis presents several derivative-free hybrid methods for large scale GO problems. These methods do not guarantee the calculation of a global solution; however, results of numerical experiments presented in this thesis demonstrate that they, as a rule, calculate a solution which is a global one or close to it. Their applications to data mining problems and the protein folding problem are demonstrated.**Description:**Doctor of Philosophy

Visual grouping of association rules for hypotheses suggestion

**Authors:**Ivkovic, Sasha**Date:**2003**Type:**Text , Thesis , Masters**Full Text:****Description:**The study descibes a KDD method that is being used by non-technical experts with mimimal training to discover and interpret patterns that they find useful for their role within their organisations.**Description:**Master of Information Technology

**Authors:**Ivkovic, Sasha**Date:**2003**Type:**Text , Thesis , Masters**Full Text:****Description:**The study descibes a KDD method that is being used by non-technical experts with mimimal training to discover and interpret patterns that they find useful for their role within their organisations.**Description:**Master of Information Technology

- «
- ‹
- 1
- ›
- »