New algorithms for multi-class cancer diagnosis using tumor gene expression signatures
- Bagirov, Adil, Ferguson, Brent, Ivkovic, Sasha, Saunders, Gary, Yearwood, John
- Authors: Bagirov, Adil , Ferguson, Brent , Ivkovic, Sasha , Saunders, Gary , Yearwood, John
- Date: 2003
- Type: Text , Journal article
- Relation: Bioinformatics Vol. 19, no. 14 (2003), p. 1800-1807
- Full Text:
- Reviewed:
- Description: Motivation: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. Results: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set.
- Description: C1
- Description: 2003000439
- Authors: Bagirov, Adil , Ferguson, Brent , Ivkovic, Sasha , Saunders, Gary , Yearwood, John
- Date: 2003
- Type: Text , Journal article
- Relation: Bioinformatics Vol. 19, no. 14 (2003), p. 1800-1807
- Full Text:
- Reviewed:
- Description: Motivation: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. Results: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set.
- Description: C1
- Description: 2003000439
RBACS : Rootkit behavioral analysis and classification system
- Lobo, Desmond, Watters, Paul, Wu, Xinwen
- Authors: Lobo, Desmond , Watters, Paul , Wu, Xinwen
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 3rd International Conference on Knowledge Discovery and Data Mining, WKDD 2010, Phuket : 9th-10th January 2010 p. 75-80
- Full Text:
- Description: In this paper, we focus on rootkits, a special type of malicious software (malware) that operates in an obfuscated and stealthy mode to evade detection. Categorizing these rootkits will help in detecting future attacks against the business community. We first developed a theoretical framework for classifying rootkits. Based on our theoretical framework, we then proposed a new rootkit classification system and tested our system on a sample of rootkits that use inline function hooking. Our experimental results showed that our system could successfully categorize the sample using unsupervised clustering. © 2010 IEEE.
- Authors: Lobo, Desmond , Watters, Paul , Wu, Xinwen
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 3rd International Conference on Knowledge Discovery and Data Mining, WKDD 2010, Phuket : 9th-10th January 2010 p. 75-80
- Full Text:
- Description: In this paper, we focus on rootkits, a special type of malicious software (malware) that operates in an obfuscated and stealthy mode to evade detection. Categorizing these rootkits will help in detecting future attacks against the business community. We first developed a theoretical framework for classifying rootkits. Based on our theoretical framework, we then proposed a new rootkit classification system and tested our system on a sample of rootkits that use inline function hooking. Our experimental results showed that our system could successfully categorize the sample using unsupervised clustering. © 2010 IEEE.
Experimental investigation of three machine learning algorithms for ITS dataset
- Yearwood, John, Kang, Byeongho, Kelarev, Andrei
- Authors: Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at First International Conference, FGIT 2009, Future Generation Information Technology, Jeju Island, Korea : 10th-12th December 2009 Vol. 5899, p. 308-316
- Full Text:
- Description: The present article is devoted to experimental investigation of the performance of three machine learning algorithms for ITS dataset in their ability to achieve agreement with classes published in the biologi cal literature before. The ITS dataset consists of nuclear ribosomal DNA sequences, where rather sophisticated alignment scores have to be used as a measure of distance. These scores do not form a Minkowski metric and the sequences cannot be regarded as points in a finite dimensional space. This is why it is necessary to develop novel machine learning ap proaches to the analysis of datasets of this sort. This paper introduces a k-committees classifier and compares it with the discrete k-means and Nearest Neighbour classifiers. It turns out that all three machine learning algorithms are efficient and can be used to automate future biologically significant classifications for datasets of this kind. A simplified version of a synthetic dataset, where the k-committees classifier outperforms k-means and Nearest Neighbour classifiers, is also presented.
- Description: 2003007844
- Authors: Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at First International Conference, FGIT 2009, Future Generation Information Technology, Jeju Island, Korea : 10th-12th December 2009 Vol. 5899, p. 308-316
- Full Text:
- Description: The present article is devoted to experimental investigation of the performance of three machine learning algorithms for ITS dataset in their ability to achieve agreement with classes published in the biologi cal literature before. The ITS dataset consists of nuclear ribosomal DNA sequences, where rather sophisticated alignment scores have to be used as a measure of distance. These scores do not form a Minkowski metric and the sequences cannot be regarded as points in a finite dimensional space. This is why it is necessary to develop novel machine learning ap proaches to the analysis of datasets of this sort. This paper introduces a k-committees classifier and compares it with the discrete k-means and Nearest Neighbour classifiers. It turns out that all three machine learning algorithms are efficient and can be used to automate future biologically significant classifications for datasets of this kind. A simplified version of a synthetic dataset, where the k-committees classifier outperforms k-means and Nearest Neighbour classifiers, is also presented.
- Description: 2003007844
Feature selection using misclassification counts
- Bagirov, Adil, Yatsko, Andrew, Stranieri, Andrew
- Authors: Bagirov, Adil , Yatsko, Andrew , Stranieri, Andrew
- Date: 2011
- Type: Conference proceedings , Unpublished work
- Relation: Proceedings of the 9th Australasian Data Mining Conference (AusDM 2011), 51-62. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 121.
- Full Text:
- Description: Dimensionality reduction of the problem space through detection and removal of variables, contributing little or not at all to classification, is able to relieve the computational load and instance acquisition effort, considering all the data attributes accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accumulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree to which different variables exhibit random characteristics. The results are being verified using the Nearest Neighbor classifier. This also helps to address the feature irrelevance and redundancy, what ranking does not immediately decide. Additionally, feature ranking methods from different independent sources are called in for the direct comparison.
- Description: Dimensionality reduction of the problem space through detection and removal of variables, contributing little or not at all to classification, is able to relieve the computational load and the data acquisition effort, considering all data components being accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accumulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree, to which different variables exhibit random characteristics. The results are being verified using the Nearest Neighbor classifier. This also helps to address the feature irrelevance, what ranking does not immediately decide. Additionally, feature ranking methods available from different independent sources are called in for direct comparison.
- Authors: Bagirov, Adil , Yatsko, Andrew , Stranieri, Andrew
- Date: 2011
- Type: Conference proceedings , Unpublished work
- Relation: Proceedings of the 9th Australasian Data Mining Conference (AusDM 2011), 51-62. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 121.
- Full Text:
- Description: Dimensionality reduction of the problem space through detection and removal of variables, contributing little or not at all to classification, is able to relieve the computational load and instance acquisition effort, considering all the data attributes accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accumulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree to which different variables exhibit random characteristics. The results are being verified using the Nearest Neighbor classifier. This also helps to address the feature irrelevance and redundancy, what ranking does not immediately decide. Additionally, feature ranking methods from different independent sources are called in for the direct comparison.
- Description: Dimensionality reduction of the problem space through detection and removal of variables, contributing little or not at all to classification, is able to relieve the computational load and the data acquisition effort, considering all data components being accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accumulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree, to which different variables exhibit random characteristics. The results are being verified using the Nearest Neighbor classifier. This also helps to address the feature irrelevance, what ranking does not immediately decide. Additionally, feature ranking methods available from different independent sources are called in for direct comparison.
Max-min separability
- Authors: Bagirov, Adil
- Date: 2005
- Type: Text , Journal article
- Relation: Optimization Methods and Software Vol. 20, no. 2-3 (2005), p. 271-290
- Full Text:
- Reviewed:
- Description: We consider the problem of discriminating two finite point sets in the n-dimensional space by a finite number of hyperplanes generating a piecewise linear function. If the intersection of these sets is empty, then they can be strictly separated by a max-min of linear functions. An error function is introduced. This function is nonconvex piecewise linear. We discuss an algorithm for its minimization. The results of numerical experiments using some real-world datasets are presented, which show the effectiveness of the proposed approach.
- Description: C1
- Description: 2003001350
- Authors: Bagirov, Adil
- Date: 2005
- Type: Text , Journal article
- Relation: Optimization Methods and Software Vol. 20, no. 2-3 (2005), p. 271-290
- Full Text:
- Reviewed:
- Description: We consider the problem of discriminating two finite point sets in the n-dimensional space by a finite number of hyperplanes generating a piecewise linear function. If the intersection of these sets is empty, then they can be strictly separated by a max-min of linear functions. An error function is introduced. This function is nonconvex piecewise linear. We discuss an algorithm for its minimization. The results of numerical experiments using some real-world datasets are presented, which show the effectiveness of the proposed approach.
- Description: C1
- Description: 2003001350
A polynomial ring construction for the classification of data
- Kelarev, Andrei, Yearwood, John, Vamplew, Peter
- Authors: Kelarev, Andrei , Yearwood, John , Vamplew, Peter
- Date: 2009
- Type: Text , Journal article
- Relation: Bulletin of the Australian Mathematical Society Vol. 79, no. 2 (2009), p. 213-225
- Full Text:
- Reviewed:
- Description: Drensky and Lakatos (Lecture Notes in Computer Science, 357 (Springer, Berlin, 1989), pp. 181-188) have established a convenient property of certain ideals in polynomial quotient rings, which can now be used to determine error-correcting capabilities of combined multiple classifiers following a standard approach explained in the well-known monograph by Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, Amsterdam, 2005)). We strengthen and generalise the result of Drensky and Lakatos by demonstrating that the corresponding nice property remains valid in a much larger variety of constructions and applies to more general types of ideals. Examples show that our theorems do not extend to larger classes of ring constructions and cannot be simplified or generalised.
- Authors: Kelarev, Andrei , Yearwood, John , Vamplew, Peter
- Date: 2009
- Type: Text , Journal article
- Relation: Bulletin of the Australian Mathematical Society Vol. 79, no. 2 (2009), p. 213-225
- Full Text:
- Reviewed:
- Description: Drensky and Lakatos (Lecture Notes in Computer Science, 357 (Springer, Berlin, 1989), pp. 181-188) have established a convenient property of certain ideals in polynomial quotient rings, which can now be used to determine error-correcting capabilities of combined multiple classifiers following a standard approach explained in the well-known monograph by Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, Amsterdam, 2005)). We strengthen and generalise the result of Drensky and Lakatos by demonstrating that the corresponding nice property remains valid in a much larger variety of constructions and applies to more general types of ideals. Examples show that our theorems do not extend to larger classes of ring constructions and cannot be simplified or generalised.
Using links to aid web classification
- Xie, Wei, Mammadov, Musa, Yearwood, John
- Authors: Xie, Wei , Mammadov, Musa , Yearwood, John
- Date: 2007
- Type: Text , Conference paper
- Relation: Paper presented at 6th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2007, Melbourne, Victoria : 11th-13th July 2007 p. 981-986
- Full Text:
- Description: In this paper, we will present a new approach of using link information to improve the accuracy and efficiency of web classification. However, different from others, we only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets. We apply SVM and BoosTexter for classification. We show that the classification accuracy can be improved based on mixtures of ordinary word features and out-linked-class features. We analyze and discuss the reason of this improvement.
- Description: 2003005438
- Authors: Xie, Wei , Mammadov, Musa , Yearwood, John
- Date: 2007
- Type: Text , Conference paper
- Relation: Paper presented at 6th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2007, Melbourne, Victoria : 11th-13th July 2007 p. 981-986
- Full Text:
- Description: In this paper, we will present a new approach of using link information to improve the accuracy and efficiency of web classification. However, different from others, we only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets. We apply SVM and BoosTexter for classification. We show that the classification accuracy can be improved based on mixtures of ordinary word features and out-linked-class features. We analyze and discuss the reason of this improvement.
- Description: 2003005438
Classification of HTML Documents
- Xie, Wei
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
The seven scam types: Mapping the terrain of cybercrime
- Stabek, Amber, Watters, Paul, Layton, Robert
- Authors: Stabek, Amber , Watters, Paul , Layton, Robert
- Date: 2010
- Type: Text , Conference proceedings
- Full Text:
- Description: Threat of cybercrime is a growing danger to the economy. Industries and businesses are targeted by cyber-criminals along with members of the general public. Since cybercrime is often a symptom of more complex criminological regimes such as laundering, trafficking and terrorism, the true damage caused to society is unknown. Dissimilarities in reporting procedures and non-uniform cybercrime classifications lead international reporting bodies to produce incompatible results which cause difficulties in making valid comparisons. A cybercrime classification framework has been identified as necessary for the development of an inter-jurisdictional, transnational, and global approach to identify, intercept, and prosecute cyber-criminals. Outlined in this paper is a cybercrime classification framework which has been applied to the incidence of scams. Content analysis was performed on over 250 scam descriptions stemming from in excess of 35 scamming categories and over 80 static features derived. Using hierarchical cluster and discriminant function analysis, the sample was reduced from over 35 ambiguous categories into 7 scam types and the top four scamming functions - identified as scamming business processes, revealed. The results of this research bear significant ramifications to the current state of scam and cybercrime classification, research and analysis, as well as offer significant insight into the business processes and applications adopted by scammers and cyber-criminals. © 2010 IEEE.
- Authors: Stabek, Amber , Watters, Paul , Layton, Robert
- Date: 2010
- Type: Text , Conference proceedings
- Full Text:
- Description: Threat of cybercrime is a growing danger to the economy. Industries and businesses are targeted by cyber-criminals along with members of the general public. Since cybercrime is often a symptom of more complex criminological regimes such as laundering, trafficking and terrorism, the true damage caused to society is unknown. Dissimilarities in reporting procedures and non-uniform cybercrime classifications lead international reporting bodies to produce incompatible results which cause difficulties in making valid comparisons. A cybercrime classification framework has been identified as necessary for the development of an inter-jurisdictional, transnational, and global approach to identify, intercept, and prosecute cyber-criminals. Outlined in this paper is a cybercrime classification framework which has been applied to the incidence of scams. Content analysis was performed on over 250 scam descriptions stemming from in excess of 35 scamming categories and over 80 static features derived. Using hierarchical cluster and discriminant function analysis, the sample was reduced from over 35 ambiguous categories into 7 scam types and the top four scamming functions - identified as scamming business processes, revealed. The results of this research bear significant ramifications to the current state of scam and cybercrime classification, research and analysis, as well as offer significant insight into the business processes and applications adopted by scammers and cyber-criminals. © 2010 IEEE.
The case for a consistent cyberscam classification framework (CCCF)
- Stabek, Amber, Brown, Simon, Watters, Paul
- Authors: Stabek, Amber , Brown, Simon , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at UIC-ATC 2009 - Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing in Conjunction with the UIC'09 and ATC'09 Conferences, Brisbane : 7th-9th July 2009 p. 525-530
- Full Text:
- Description: Cyberscam classification schemes developed by international statistical reporting bodies, including the Bureau of Statistics (Australia), the Internet Crime Complaint Center (US), and the Environics Research Group (Canada), are diverse and largely incompatible. This makes comparisons of cyberscam incidence across jurisdictions very difficult. This paper argues that the critical first step towards the development of an inter-jurisdictional and global approach to identify and intercept cyberscams - and prosecute scammers - is a uniform classification system. © 2009 IEEE.
- Authors: Stabek, Amber , Brown, Simon , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at UIC-ATC 2009 - Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing in Conjunction with the UIC'09 and ATC'09 Conferences, Brisbane : 7th-9th July 2009 p. 525-530
- Full Text:
- Description: Cyberscam classification schemes developed by international statistical reporting bodies, including the Bureau of Statistics (Australia), the Internet Crime Complaint Center (US), and the Environics Research Group (Canada), are diverse and largely incompatible. This makes comparisons of cyberscam incidence across jurisdictions very difficult. This paper argues that the critical first step towards the development of an inter-jurisdictional and global approach to identify and intercept cyberscams - and prosecute scammers - is a uniform classification system. © 2009 IEEE.
From convex to nonconvex: A loss function analysis for binary classification
- Zhao, Lei, Mammadov, Musa, Yearwood, John
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.
Data mining with combined use of optimization techniques and self-organizing maps for improving risk grouping rules : Application to prostate cancer patients
- Churilov, Leonid, Bagirov, Adil, Schwartz, Daniel, Smith, Kate, Dally, Michael
- Authors: Churilov, Leonid , Bagirov, Adil , Schwartz, Daniel , Smith, Kate , Dally, Michael
- Date: 2005
- Type: Text , Journal article
- Relation: Journal of Management Information Systems Vol. 21, no. 4 (2005), p. 85-100
- Full Text:
- Reviewed:
- Description: Data mining techniques provide a popular and powerful tool set to generate various data-driven classification systems. In this paper, we investigate the combined use of self-organizing maps (SOM) and nonsmooth nonconvex optimization techniques in order to produce a working case of a data-driven risk classification system. The optimization approach strengthens the validity of SOM results, and the improved classification system increases both the quality of prediction and the homogeneity within the risk groups. Accurate classification of prostate cancer patients into risk groups is important to assist in the identification of appropriate treatment paths. We start with the existing rules and aim to improve classification accuracy by identifying inconsistencies utilizing self-organizing maps as a data visualization tool. Then, we progress to the study of assigning prostate cancer patients into homogenous groups with the aim to support future clinical treatment decisions. Using the case of prostate cancer patients grouping, we demonstrate strong potential of data-driven risk classification schemes for addressing the risk grouping issues in more general organizational settings. © 2005 M.E. Sharpe, Inc.
- Description: C1
- Description: 2003001265
- Authors: Churilov, Leonid , Bagirov, Adil , Schwartz, Daniel , Smith, Kate , Dally, Michael
- Date: 2005
- Type: Text , Journal article
- Relation: Journal of Management Information Systems Vol. 21, no. 4 (2005), p. 85-100
- Full Text:
- Reviewed:
- Description: Data mining techniques provide a popular and powerful tool set to generate various data-driven classification systems. In this paper, we investigate the combined use of self-organizing maps (SOM) and nonsmooth nonconvex optimization techniques in order to produce a working case of a data-driven risk classification system. The optimization approach strengthens the validity of SOM results, and the improved classification system increases both the quality of prediction and the homogeneity within the risk groups. Accurate classification of prostate cancer patients into risk groups is important to assist in the identification of appropriate treatment paths. We start with the existing rules and aim to improve classification accuracy by identifying inconsistencies utilizing self-organizing maps as a data visualization tool. Then, we progress to the study of assigning prostate cancer patients into homogenous groups with the aim to support future clinical treatment decisions. Using the case of prostate cancer patients grouping, we demonstrate strong potential of data-driven risk classification schemes for addressing the risk grouping issues in more general organizational settings. © 2005 M.E. Sharpe, Inc.
- Description: C1
- Description: 2003001265
Cayley graphs as classifiers for data mining : The influence of asymmetries
- Kelarev, Andrei, Ryan, Joe, Yearwood, John
- Authors: Kelarev, Andrei , Ryan, Joe , Yearwood, John
- Date: 2009
- Type: Text , Journal article
- Relation: Discrete Mathematics Vol. 309, no. 17 (2009), p. 5360-5369
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text:
- Reviewed:
- Description: The endomorphism monoids of graphs have been actively investigated. They are convenient tools expressing asymmetries of the graphs. One of the most important classes of graphs considered in this framework is that of Cayley graphs. Our paper proposes a new method of using Cayley graphs for classification of data. We give a survey of recent results devoted to the Cayley graphs also involving their endomorphism monoids. © 2008 Elsevier B.V. All rights reserved.
- Authors: Kelarev, Andrei , Ryan, Joe , Yearwood, John
- Date: 2009
- Type: Text , Journal article
- Relation: Discrete Mathematics Vol. 309, no. 17 (2009), p. 5360-5369
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text:
- Reviewed:
- Description: The endomorphism monoids of graphs have been actively investigated. They are convenient tools expressing asymmetries of the graphs. One of the most important classes of graphs considered in this framework is that of Cayley graphs. Our paper proposes a new method of using Cayley graphs for classification of data. We give a survey of recent results devoted to the Cayley graphs also involving their endomorphism monoids. © 2008 Elsevier B.V. All rights reserved.
Nonsmooth optimisation approach to data classification
- Bagirov, Adil, Soukhoroukova, Nadejda
- Authors: Bagirov, Adil , Soukhoroukova, Nadejda
- Date: 2001
- Type: Text , Conference paper
- Relation: Paper presented at Post-graduate ADFA Conference for Computer Science, PACCS01, Canberra, Australian Capital Territory : 14th July 2001
- Full Text:
- Description: We reduce the supervised classification to solving a nonsmooth optimization problem. The proposed method allows one to solve classification problems for databases with arbitrary number of classes. Numerical experiments have been carried out with databases of small and medium size. We present their results and provide comparison of these results with ones obtained by other algorithms of classification based on the optimization techniques. Results of numerical experiments show effectiveness of the proposed algorithms.
- Description: 2003003668
- Authors: Bagirov, Adil , Soukhoroukova, Nadejda
- Date: 2001
- Type: Text , Conference paper
- Relation: Paper presented at Post-graduate ADFA Conference for Computer Science, PACCS01, Canberra, Australian Capital Territory : 14th July 2001
- Full Text:
- Description: We reduce the supervised classification to solving a nonsmooth optimization problem. The proposed method allows one to solve classification problems for databases with arbitrary number of classes. Numerical experiments have been carried out with databases of small and medium size. We present their results and provide comparison of these results with ones obtained by other algorithms of classification based on the optimization techniques. Results of numerical experiments show effectiveness of the proposed algorithms.
- Description: 2003003668
A formula for multiple classifiers in data mining based on Brandt semigroups
- Kelarev, Andrei, Yearwood, John, Mammadov, Musa
- Authors: Kelarev, Andrei , Yearwood, John , Mammadov, Musa
- Date: 2009
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 78, no. 2 (2009), p. 293-309
- Full Text:
- Reviewed:
- Description: A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups. © 2008 Springer Science+Business Media, LLC.
- Authors: Kelarev, Andrei , Yearwood, John , Mammadov, Musa
- Date: 2009
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 78, no. 2 (2009), p. 293-309
- Full Text:
- Reviewed:
- Description: A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups. © 2008 Springer Science+Business Media, LLC.
Coding OSICS sports injury diagnoses in epidemiological studies : Does the background of the coder matter?
- Finch, Caroline, Orchard, John, Twomey, Dara, Saleem, Muhammad Saad, Ekegren, Christina, Lloyd, David, Elliott, Bruce
- Authors: Finch, Caroline , Orchard, John , Twomey, Dara , Saleem, Muhammad Saad , Ekegren, Christina , Lloyd, David , Elliott, Bruce
- Date: 2012
- Type: Text , Journal article
- Relation: British Journal of Sports Medicine, Vol.48, p.552-556.
- Relation: http://purl.org/au-research/grants/nhmrc/565900
- Full Text:
- Reviewed:
- Description: Objective: To compare Orchard Sports Injury Classification System (OSICS-10) sports medicine diagnoses assigned by a clinical and non-clinical coder. Design: Assessment of intercoder agreement. Setting: Community Australian football. Participants: 1082 standardised injury surveillance records. Main outcome measurements: Direct comparison of the four-character hierarchical OSICS-10 codes assigned by two independent coders (a sports physician and an epidemiologist). Adjudication by a third coder (biomechanist). Results: The coders agreed on the first character 95% of the time and on the first two characters 86% of the time. They assigned the same four-digit OSICS-10 code for only 46% of the 1082 injuries. The majority of disagreements occurred for the third character; 85% were because one coder assigned a non-specific 'X' code. The sports physician code was deemed correct in 53% of cases and the epidemiologist in 44%. Reasons for disagreement included the physician not using all of the collected information and the epidemiologist lacking specific anatomical knowledge. Conclusions: Sports injury research requires accurate identification and classification of specific injuries and this study found an overall high level of agreement in coding according to OSICS-10. The fact that the majority of the disagreements occurred for the third OSICS character highlights the fact that increasing complexity and diagnostic specificity in injury coding can result in a loss of reliability and demands a high level of anatomical knowledge. Injury report form details need to reflect this level of complexity and data management teams need to include a broad range of expertise. Copyright Article author (or their employer) 2012.
- Authors: Finch, Caroline , Orchard, John , Twomey, Dara , Saleem, Muhammad Saad , Ekegren, Christina , Lloyd, David , Elliott, Bruce
- Date: 2012
- Type: Text , Journal article
- Relation: British Journal of Sports Medicine, Vol.48, p.552-556.
- Relation: http://purl.org/au-research/grants/nhmrc/565900
- Full Text:
- Reviewed:
- Description: Objective: To compare Orchard Sports Injury Classification System (OSICS-10) sports medicine diagnoses assigned by a clinical and non-clinical coder. Design: Assessment of intercoder agreement. Setting: Community Australian football. Participants: 1082 standardised injury surveillance records. Main outcome measurements: Direct comparison of the four-character hierarchical OSICS-10 codes assigned by two independent coders (a sports physician and an epidemiologist). Adjudication by a third coder (biomechanist). Results: The coders agreed on the first character 95% of the time and on the first two characters 86% of the time. They assigned the same four-digit OSICS-10 code for only 46% of the 1082 injuries. The majority of disagreements occurred for the third character; 85% were because one coder assigned a non-specific 'X' code. The sports physician code was deemed correct in 53% of cases and the epidemiologist in 44%. Reasons for disagreement included the physician not using all of the collected information and the epidemiologist lacking specific anatomical knowledge. Conclusions: Sports injury research requires accurate identification and classification of specific injuries and this study found an overall high level of agreement in coding according to OSICS-10. The fact that the majority of the disagreements occurred for the third OSICS character highlights the fact that increasing complexity and diagnostic specificity in injury coding can result in a loss of reliability and demands a high level of anatomical knowledge. Injury report form details need to reflect this level of complexity and data management teams need to include a broad range of expertise. Copyright Article author (or their employer) 2012.
Detecting K-complexes for sleep stage identification using nonsmooth optimization
- Moloney, David, Sukhorukova, Nadezda, Vamplew, Peter, Ugon, Julien, Li, Gang, Beliakov, Gleb, Philippe, Carole, Amiel, Hélène, Ugon, Adrien
- Authors: Moloney, David , Sukhorukova, Nadezda , Vamplew, Peter , Ugon, Julien , Li, Gang , Beliakov, Gleb , Philippe, Carole , Amiel, Hélène , Ugon, Adrien
- Date: 2012
- Type: Text , Journal article
- Relation: ANZIAM Journal Vol. 52, no. 4 (2012), p. 319-332
- Full Text:
- Reviewed:
- Description: The process of sleep stage identification is a labour-intensive task that involves the specialized interpretation of the polysomnographic signals captured from a patient's overnight sleep session. Automating this task has proven to be challenging for data mining algorithms because of noise, complexity and the extreme size of data. In this paper we apply nonsmooth optimization to extract key features that lead to better accuracy. We develop a specific procedure for identifying K-complexes, a special type of brain wave crucial for distinguishing sleep stages. The procedure contains two steps. We first extract "easily classified" K-complexes, and then apply nonsmooth optimization methods to extract features from the remaining data and refine the results from the first step. Numerical experiments show that this procedure is efficient for detecting K-complexes. It is also found that most classification methods perform significantly better on the extracted features. © 2012 Australian Mathematical Society.
- Authors: Moloney, David , Sukhorukova, Nadezda , Vamplew, Peter , Ugon, Julien , Li, Gang , Beliakov, Gleb , Philippe, Carole , Amiel, Hélène , Ugon, Adrien
- Date: 2012
- Type: Text , Journal article
- Relation: ANZIAM Journal Vol. 52, no. 4 (2012), p. 319-332
- Full Text:
- Reviewed:
- Description: The process of sleep stage identification is a labour-intensive task that involves the specialized interpretation of the polysomnographic signals captured from a patient's overnight sleep session. Automating this task has proven to be challenging for data mining algorithms because of noise, complexity and the extreme size of data. In this paper we apply nonsmooth optimization to extract key features that lead to better accuracy. We develop a specific procedure for identifying K-complexes, a special type of brain wave crucial for distinguishing sleep stages. The procedure contains two steps. We first extract "easily classified" K-complexes, and then apply nonsmooth optimization methods to extract features from the remaining data and refine the results from the first step. Numerical experiments show that this procedure is efficient for detecting K-complexes. It is also found that most classification methods perform significantly better on the extracted features. © 2012 Australian Mathematical Society.
Hybrid technique for colour image classification and efficient retrieval based on fuzzy logic and neural networks
- Fernando, Ranisha, Kulkarni, Siddhivinayak
- Authors: Fernando, Ranisha , Kulkarni, Siddhivinayak
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Developments in the technology and the Internet have led to increase in number of digital images and videos. Thousands of images are added to WWW every day. To retrieve the specific images efficiently from database or from Internet is becoming a challenge now a day. As a result, the necessity of retrieving images has emerged to be important to various professional areas. This paper proposes a novel fuzzy approach to classify the colour images based on their content, to pose a query in terms of natural language and fuse the queries based on neural networks for fast and efficient retrieval. Number of experiments was conducted for classification and retrieval of images on sets of images and promising results were obtained. The results were analysed and compared with other similar image retrieval system. © 2012 IEEE.
- Authors: Fernando, Ranisha , Kulkarni, Siddhivinayak
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Developments in the technology and the Internet have led to increase in number of digital images and videos. Thousands of images are added to WWW every day. To retrieve the specific images efficiently from database or from Internet is becoming a challenge now a day. As a result, the necessity of retrieving images has emerged to be important to various professional areas. This paper proposes a novel fuzzy approach to classify the colour images based on their content, to pose a query in terms of natural language and fuse the queries based on neural networks for fast and efficient retrieval. Number of experiments was conducted for classification and retrieval of images on sets of images and promising results were obtained. The results were analysed and compared with other similar image retrieval system. © 2012 IEEE.
Application of rank correlation, clustering and classification in information security
- Beliakov, Gleb, Yearwood, John, Kelarev, Andrei
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Networks Vol. 7, no. 6 (2012), p. 935-945
- Full Text:
- Reviewed:
- Description: This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman-Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms. © 2012 Academy Publisher.
- Description: 2003010277
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Networks Vol. 7, no. 6 (2012), p. 935-945
- Full Text:
- Reviewed:
- Description: This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman-Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms. © 2012 Academy Publisher.
- Description: 2003010277
Rule-based classifiers and meta classifiers for identification of cardiac autonomic neuropathy progression
- Jelinek, Herbert, Kelarev, Andrei, Stranieri, Andrew, Yearwood, John
- Authors: Jelinek, Herbert , Kelarev, Andrei , Stranieri, Andrew , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: International Journal of Information Science and Computer Mathematics Vol. 5, no. 2 (2012), p. 49-53
- Full Text:
- Reviewed:
- Description: We investigate and compare several rule-based classifiers and meta classifiers in their ability to obtain multi-class classifications of cardiac autonomic neuropathy (CAN) and its progression. The best results obtained in our experiments are significantly better than the outcomes published previously in the literature for analogous CAN identification tasks or simpler binary classification tasks.
- Authors: Jelinek, Herbert , Kelarev, Andrei , Stranieri, Andrew , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: International Journal of Information Science and Computer Mathematics Vol. 5, no. 2 (2012), p. 49-53
- Full Text:
- Reviewed:
- Description: We investigate and compare several rule-based classifiers and meta classifiers in their ability to obtain multi-class classifications of cardiac autonomic neuropathy (CAN) and its progression. The best results obtained in our experiments are significantly better than the outcomes published previously in the literature for analogous CAN identification tasks or simpler binary classification tasks.