Applying clustering and ensemble clustering approaches to phishing profiling
- Authors: Webb, Dean , Yearwood, John , Vamplew, Peter , Ma, Liping , Ofoghi, Bahadorreza , Kelarev, Andrei
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at Eighth Australasian Data Mining Conference, AusDM 2009, University of Melbourne, Melbourne, Victoria : 1st–4th December 2009
- Full Text:
- Description: 2003007911
An efficient algorithm for the incremental construction of a piecewise linear classifier
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean
- Date: 2011
- Type: Text , Journal article
- Relation: Information Systems Vol. 36, no. 4 (2011), p. 782-790
- Relation: http://purl.org/au-research/grants/arc/DP0666061
- Full Text: false
- Reviewed:
- Description: In this paper the problem of finding piecewise linear boundaries between sets is considered and is applied for solving supervised data classification problems. An algorithm for the computation of piecewise linear boundaries, consisting of two main steps, is proposed. In the first step sets are approximated by hyperboxes to find so-called "indeterminate" regions between sets. In the second step sets are separated inside these "indeterminate" regions by piecewise linear functions. These functions are computed incrementally starting with a linear function. Results of numerical experiments are reported. These results demonstrate that the new algorithm requires a reasonable training time and it produces consistently good test set accuracy on most data sets comparing with mainstream classifiers. © 2010 Elsevier B.V. All rights reserved.
Fast modified global k-means algorithm for incremental cluster construction
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean
- Date: 2011
- Type: Text , Journal article
- Relation: Pattern Recognition Vol. 44, no. 4 (2011), p. 866-876
- Relation: http://purl.org/au-research/grants/arc/DP0666061
- Full Text: false
- Reviewed:
- Description: The k-means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and are inefficient for solving clustering problems in large datasets. Recently, incremental approaches have been developed to resolve difficulties with the choice of starting points. The global k-means and the modified global k-means algorithms are based on such an approach. They iteratively add one cluster center at a time. Numerical experiments show that these algorithms considerably improve the k-means algorithm. However, they require storing the whole affinity matrix or computing this matrix at each iteration. This makes both algorithms time consuming and memory demanding for clustering even moderately large datasets. In this paper, a new version of the modified global k-means algorithm is proposed. We introduce an auxiliary cluster function to generate a set of starting points lying in different parts of the dataset. We exploit information gathered in previous iterations of the incremental algorithm to eliminate the need of computing or storing the whole affinity matrix and thereby to reduce computational effort and memory usage. Results of numerical experiments on six standard datasets demonstrate that the new algorithm is more efficient than the global and the modified global k-means algorithms. © 2010 Elsevier Ltd. All rights reserved.
Efficient piecewise linear classifiers and applications
- Authors: Webb, Dean
- Date: 2011
- Type: Text , Thesis , PhD
- Full Text:
- Description: Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.
- Description: Doctor of Philosophy
A novel piecewise linear classifier based on polyhedral conic and max-min separabilities
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean , Ozturk, Gurkan , Kasimbeyli, Refail
- Date: 2011
- Type: Text , Journal article
- Relation: TOP Vol.21, no.1 (2011), p. 1-22
- Full Text: false
- Reviewed:
- Description: In this paper, an algorithm for finding piecewise linear boundaries between pattern classes is developed. This algorithm consists of two main stages. In the first stage, a polyhedral conic set is used to identify data points which lie inside their classes, and in the second stage we exclude those points to compute a piecewise linear boundary using the remaining data points. Piecewise linear boundaries are computed incrementally starting with one hyperplane. Such an approach allows one to significantly reduce the computational effort in many large data sets. Results of numerical experiments are reported. These results demonstrate that the new algorithm consistently produces a good test set accuracy on most data sets comparing with a number of other mainstream classifiers. © 2011 Sociedad de EstadÃstica e Investigación Operativa.
Compiling and using input-output frameworks through collaborative virtual laboratories
- Authors: Lenzen, Manfred , Geschke, Arne , Wiedmann, Thomas , Lane, Joe , Anderson, Neal , Baynes, Timothy , Boland, John , Daniels, Peter , Dey, Christopher , Fry, Jacob , Hadjikakou, Michalis , Kenway, Steven , Malik, Arunima , Moran, Daniel , Murray, Joy , Nettleton, Stuart , Poruschi, Lavinia , Reynolds, Christian , Rowley, Hazel , Ugon, Julien , Webb, Dean , West, James
- Date: 2014
- Type: Text , Journal article
- Relation: Science of the Total Environment Vol. 485-486, no. 1 (July 2014), p. 241-251
- Full Text: false
- Reviewed:
- Description: Compiling, deploying and utilising large-scale databases that integrate environmental and economic data have traditionally been labour- and cost-intensive processes, hindered by the large amount of disparate and misaligned data that must be collected and harmonised. The Australian Industrial Ecology Virtual Laboratory (IELab) is a novel, collaborative approach to compiling large-scale environmentally extended multi-region input-output (MRIO) models.The utility of the IELab product is greatly enhanced by avoiding the need to lock in an MRIO structure at the time the MRIO system is developed. The IELab advances the idea of the "mother-daughter" construction principle, whereby a regionally and sectorally very detailed "mother" table is set up, from which "daughter" tables are derived to suit specific research questions. By introducing a third tier - the "root classification" - IELab users are able to define their own mother-MRIO configuration, at no additional cost in terms of data handling. Customised mother-MRIOs can then be built, which maximise disaggregation in aspects that are useful to a family of research questions.The second innovation in the IELab system is to provide a highly automated collaborative research platform in a cloud-computing environment, greatly expediting workflows and making these computational benefits accessible to all users.Combining these two aspects realises many benefits. The collaborative nature of the IELab development project allows significant savings in resources. Timely deployment is possible by coupling automation procedures with the comprehensive input from multiple teams. User-defined MRIO tables, coupled with high performance computing, mean that MRIO analysis will be useful and accessible for a great many more research applications than would otherwise be possible. By ensuring that a common set of analytical tools such as for hybrid life-cycle assessment is adopted, the IELab will facilitate the harmonisation of fragmented, dispersed and misaligned raw data for the benefit of all interested parties.
Classification through incremental max-min separability
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean , Karasozen, Bulent
- Date: 2011
- Type: Text , Journal article
- Relation: Pattern Analysis and Applications Vol. 14, no. 2 (2011), p. 165-174
- Relation: http://purl.org/au-research/grants/arc/DP0666061
- Full Text: false
- Reviewed:
- Description: Piecewise linear functions can be used to approximate non-linear decision boundaries between pattern classes. Piecewise linear boundaries are known to provide efficient real-time classifiers. However, they require a long training time. Finding piecewise linear boundaries between sets is a difficult optimization problem. Most approaches use heuristics to avoid solving this problem, which may lead to suboptimal piecewise linear boundaries. In this paper, we propose an algorithm for globally training hyperplanes using an incremental approach. Such an approach allows one to find a near global minimizer of the classification error function and to compute as few hyperplanes as needed for separating sets. We apply this algorithm for solving supervised data classification problems and report the results of numerical experiments on real-world data sets. These results demonstrate that the new algorithm requires a reasonable training time and its test set accuracy is consistently good on most data sets compared with mainstream classifiers. © 2010 Springer-Verlag London Limited.
Profiling phishing activity based on hyperlinks extracted from phishing emails
- Authors: Yearwood, John , Mammadov, Musa , Webb, Dean
- Date: 2012
- Type: Text , Journal article
- Relation: Social Network Analysis and Mining Vol. 2, no. 1 (2012), p. 5-16
- Full Text: false
- Reviewed:
- Description: Phishing activity has recently been focused on social networking sites as a more effective way of exploiting not only the technology but also the trust that may exist between members in a social network. In this paper, a novel method for profiling phishing activity from an analysis of phishing emails is proposed. Profiling is useful in determining the activity of an individual or a particular group of phishers. Work in the area of phishing is usually aimed at detection of phishing emails. In this paper, we concentrate on profiling as distinct from detection of phishing emails. We formulate the profiling problem as a multi-label classification problem using the hyperlinks in the phishing emails as features and structural properties of emails along with whois (i.e. DNS) information on hyperlinks as profile classes. Further, we generate profiles based on the classifier predictions. Thus, classes become elements of profiles. We employ a boosting algorithm (AdaBoost) as well as SVM to generate multi-label class predictions on three different datasets created from hyperlink information in phishing emails. These predictions are further utilized to generate complete profiles of these emails. Results show that profiling can be done with quite high accuracy using hyperlink information.
A new modified global k-means algorithm for clustering large data sets
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at XIIIth International Conference : Applied Stochastic Models and Data Analysis, ASMDA 2009, Vilnius, Lithuania : 30th June - 3rd July 2009 p. 1-5
- Full Text: false
- Description: The k-means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and inefficient for solving clustering problems in large data sets. Recently, in order to resolve difficulties with the choice of starting points, incremental approaches have been developed. The modified global k-means algorithm is based on such an approach. It iteratively adds one cluster center at a time. Numerical experiments show that this algorithm considerably improve the k-means algorithm. However, this algorithm is not suitable for clustering very large data sets. In this paper, a new version of the modified global k-means algorithm is proposed. We introduce an auxiliary cluster function to generate a set of starting points spanning different parts of the data set. We exploit information gathered in previous iterations of the incremental algorithm to reduce its complexity.
- Description: 2003007558
An incremental approach for the construction of a piecewise linear classifier
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at XIIIth International Conference : Applied Stochastic Models and Data Analysis, ASMDA 2009, Vilnius, Lithuania : 30th June - 3rd July 2009 p. 507–511
- Relation: https://purl.org/au-research/grants/arc/DP0666061
- Full Text: false
- Description: In this paper the problem of finding piecewise linear boundaries between sets is considered and is applied for solving supervised data classification problems. An algorithm for the computation of piecewise linear boundaries, consisting of two main steps, is proposed. In the first step sets are approximated by hyperboxes to find so-called “indeterminate” regions between sets. In the second step sets are separated inside these “indeterminate” regions by piecewise linear functions. These functions are computed incrementally starting with a linear function. Results of numerical experiments are reported. These results demonstrate that the new algorithm requires a reasonable training time and it produces consistently good test set accuracy on most data sets comparing with mainstream classifiers.
- Description: 2003007559