Languages recognized by two-sided automata of graphs
- Authors: Miller, Mirka , Kelarev, Andrei , Sokratova, Olga
- Date: 2005
- Type: Text , Journal article
- Relation: Proceedings of the Estonian Academy of Sciences, Physics Mathematic Vol. 51, no. 1 (2005), p. 46-54
- Full Text: false
- Reviewed:
- Description: We introduce two-sided automata defined by directed graphs and describe all languages recognized by these automata.
- Description: C1
- Description: 2003001399
An algorithm for BCH codes extended with finite state automata
- Authors: Kelarev, Andrei
- Date: 2008
- Type: Text , Journal article
- Relation: Fundamenta Informaticae Vol. 84, no. 1 (2008), p. 51-60
- Full Text: false
- Reviewed:
- Description: This articles develops a combinatorial algorithm for a class of codes extending BCH codes and constructed with finite state automata. Ou algorithm computes the largest number of errors that the extended codes can correct, and finds a generator for each optimal code in this class of extensions. The question of finding codes with largest possible information rates remains open.
- Description: C1
Experimental investigation of clasification algorithms for ITS dataset
- Authors: Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2008
- Type: Text , Conference paper
- Relation: PKAW-08, Pacific Rim Knowledge Acquisition Workshop 2008, as part of PRICAI 2008, Tenth Pacific Rim p. 262-272
- Full Text: false
- Reviewed:
- Description: This article is devoted to experimental investigation of classification algorithms for analysis of ITS dataset. We introduce and consider a novel k-committees alogorithm for classification and compare it with the discrete k- means and nearest neighbour algorithms. The ITS dataset consists of nuclear ribosomal DNA sequences, where rather sophisticated alignment scores have to be used as a measure of distance. These scores do not form Minkowski metric and the sequences cannot be regarded as points in a finite dimensional space. This is why it is necessary to develop novel algorithms and adjust familiar ones. We present the results of experiments comparing the efficiency of three classification methods in their ability to achieve agreement with classes published in the biological literature before. It turns out that our algorithms are efficient and can be used to obtain biologically significant classifications. A simplified version of a synthetic dataset, where the k-committees classifier out performs k-means and Nearest Neighbour classifiers, is also presented.
- Description: E1
An algorithm for the optimization of multiple classifers in data mining based on graphs
- Authors: Kelarev, Andrei , Ryan, Joe , Yearwood, John
- Date: 2009
- Type: Text , Journal article
- Relation: The Journal of Combinatorial Mathematics and Combinatorial Computing Vol. 71, no. (2009), p. 65-85
- Full Text: false
- Reviewed:
- Description: This article develops an efficient combinatorial algorithm based on labeled directed graphs and motivated by applications in data mining for designing multiple classifiers. Our method originates from the standard approach described in [37]. It defines a representation of a multiclass classifier in terms of several binary classifiers. We are using labeled graphs to introduce additional structure on the classifier. Representations of this sort are known to have serious advantages. An important property of these representations is their ability to correct errors of individual binary classifiers and produce correct combined output. For every representation like this we develop a combinatorial algorithm with quadratic running time to compute the largest number of errors of individual binary classifiers which can be corrected by the combined multiple classifier. In addition, we consider the question of optimizing the classifiers of this type and find all optimal representations for these multiple classifiers.
- Description: 2003007563
Optimization methods and the k-committees algorithm for clustering of sequence data
- Authors: Yearwood, John , Bagirov, Adil , Kelarev, Andrei
- Date: 2009
- Type: Text , Journal article
- Relation: Applied and Computational Mathematics Vol. 8, no. 1 (2009), p. 92-101
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Relation: http://purl.org/au-research/grants/arc/DP0666061
- Full Text: false
- Description: The present paper is devoted to new algorithms for unsupervised clustering based on the optimization approaches due to [2], [3] and [4]. We consider a novel situation, where the datasets consist of nucleotide or protein sequences and rather sophisticated biologically significant alignment scores have to be used as a measure of distance. Sequences of this kind cannot be regarded as points in a finite dimensional space. Besides, the alignment scores do not satisfy properties of Minkowski metrics. Nevertheless the optimization approaches have made it possible to introduce a new k-committees algorithm and compare its performance with previous algorithms for two datasets. Our experimental results show that the k-committees algorithms achieves intermediate accuracy for a dataset of ITS sequences, and it can perform better than the discrete k-means and Nearest Neighbour algorithms for certain datasets. All three algorithms achieve good agreement with clusters published in the biological literature before and can be used to obtain biologically significant clusterings.
Internet security applications of Grobner-Shirvov bases
- Authors: Kelarev, Andrei , Yearwood, John , Watters, Paul
- Date: 2010
- Type: Text , Journal article
- Relation: Asian-European Journal of Mathematics Vol. 3, no. 3 (2010), p. 435-442
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text: false
- Reviewed:
A Grobner-Shirshov Algorithm for Applications in Internet Security
- Authors: Kelarev, Andrei , Yearwood, John , Watters, Paul , Wu, Xinwen , Ma, Liping , Abawajy, Jemal , Pan, L.
- Date: 2011
- Type: Text , Journal article
- Relation: Southeast Asian Bulletin of Mathematics Vol. 35, no. (2011), p. 807-820
- Full Text: false
- Reviewed:
- Description: The design of multiple classication and clustering systems for the detection of malware is an important problem in internet security. Grobner-Shirshov bases have been used recently by Dazeley et al. [15] to develop an algorithm for constructions with certain restrictions on the sandwich-matrices. We develop a new Grobner-Shirshov algorithm which applies to a larger variety of constructions based on combinatorial Rees matrix semigroups without any restrictions on the sandwich-matrices.
An application of novel clustering technique for information security
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2011
- Type: Text , Conference paper
- Relation: Applications and Techniques in Information Security Workshop p. 5-11
- Full Text: false
- Reviewed:
- Description: This article presents experimental results devoted to a new application of the novel clustering technique introduced by the authors recently. Our aim is to facilitate the application of robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on the particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, we use a consensus function to combine these independent clusterings into one consensus clustering . Feature ranking is used to select a subset of features for the consensus function. Third, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of three consensus functions, Cluster-Based Graph Formulation (CBGF), Hybrid Bipartite Graph Formulation (HBGF), and Instance-Based Graph Formulation (IBGF) and a variety of supervised classification algorithms. The best precision and recall have been obtained by the combination of the HBGF consensus function and the SMO classifier with the polynomial kernel.
- Description: 2003009195
Establishing reasoning communities of security experts for Internet Commerce Security
- Authors: Kelarev, Andrei , Brown, Simon , Watters, Paul , Wu, Xinwen , Dazeley, Richard
- Date: 2011
- Type: Text , Book chapter
- Relation: Technologies for supporting reasoning communities and collaborative decision making : Cooperative approaches p. 380-396
- Full Text: false
- Reviewed:
- Description: The highly sophisticated and rapidly evolving area of internet commerce security presents many novel challenges for the organization of discourse in reasoning communities. This chapter suggests appropriate reasoning methods and demonstrates how establishing reasoning communities of security experts and enabling productive group discourse among them can play a crucial role in successful resolution of problems concerning the implementation, integration, deployment and maintenance of flexible local security systems for defense against malware threats in internet security. Local security systems of this sort may combine several ready open source or commercial software packages behind a common front-end and may enhance and supplement their facilities with additional plug-ins. To illustrate the diverse character of challenges the reasoning communities in internet security are likely to be faced with, this chapter concentrates on defense against phishing attacks. This example was selected as it is one of the newest and most rapidly changing application domains for the principles of organizing reasoning communities. The major group discourse methods suggested for the reasoning communities of security experts in this chapter include the Delphi Method, the Wideband Delphi Process, the Generic/Actual Argument Model of Structured Reasoning, Brainstorming, Reverse Brainstorming, Consensus Decision Making, Voting, Open Delphi and Open Brainstorming Methods. The Delphi Method and Wideband Delphi Process are suggested as tools for organizing a cohesive reasoning architecture, for coordinating other methods, and for preparing and allocating other methods to particular issues.
A multi-tier ensemble construction of classifiers for phishing email detection and filtering
- Authors: Abawajy, Jemal , Kelarev, Andrei
- Date: 2012
- Type: Text , Conference paper
- Relation: 4th International Symposium on Cyberspace Safety and Security, CSS 2012 Vol. 7672 LNCS, p. 48-56
- Full Text: false
- Reviewed:
- Description: This paper is devoted to multi-tier ensemble classifiers for the detection and filtering of phishing emails. We introduce a new construction of ensemble classifiers, based on the well known and productive multi-tier approach. Our experiments evaluate their performance for the detection and filtering of phishing emails. The multi-tier constructions are well known and have been used to design effective classifiers for email classification and other applications previously. We investigate new multi-tier ensemble classifiers, where diverse ensemble methods are combined in a unified system by incorporating different ensembles at a lower tier as an integral part of another ensemble at the top tier. Our novel contribution is to investigate the possibility and effectiveness of combining diverse ensemble methods into one large multi-tier ensemble for the example of detection and filtering of phishing emails. Our study handled a few essential ensemble methods and more recent approaches incorporated into a combined multi-tier ensemble classifier. The results show that new large multi-tier ensemble classifiers achieved better performance compared with the outcomes of the base classifiers and ensemble classifiers incorporated in the multi-tier system. This demonstrates that the new method of combining diverse ensembles into one unified multi-tier ensemble can be applied to increase the performance of classifiers if diverse ensembles are incorporated in the system. © 2012 Springer-Verlag.
- Description: 2003010675
Applications of machine learning for linguistic analysis of texts
- Authors: Torney, Rosemary , Yearwood, John , Vamplew, Peter , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 133-148
- Full Text: false
- Reviewed:
- Description: This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors' experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.
Detection of CAN by ensemble classifiers based on Ripple Down rules
- Authors: Kelarev, Andrei , Dazeley, Richard , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Book chapter
- Relation: Knowledge Management and Acquisition for Intelligent Systems p. 147-159
- Full Text: false
- Reviewed:
- Description: It is well known that classification models produced by the Ripple Down Rules are easier to maintain and update. They are compact and can provide an explanation of their reasoning making them easy to understand for medical practitioners. This article is devoted to an empirical investigation and comparison of several ensemble methods based on Ripple Down Rules in a novel application for the detection of cardiovascular autonomic neuropathy (CAN) from an extensive data set collected by the Diabetes Complications Screening Research Initiative at Charles Sturt University. Our experiments included essential ensemble methods, several more recent state-of-the-art techniques, and a novel consensus function based on graph partitioning. The results show that our novel application of Ripple Down Rules in ensemble classifiers for the detection of CAN achieved better performance parameters compared with the outcomes obtained previously in the literature.
Empirical investigation of consensus clustering for large ECG data sets
- Authors: Kelarev, Andrei , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: This article investigates a novel machine learning approach applying consensus clustering in conjunction with classification for the data mining of very large and highly dimensional ECG data sets. To obtain robust and stable clusterings, consensus functions can be applied for clustering ensembles combining a multitude of independent initial clusterings. Direct applications of consensus functions to highly dimensional ECG data sets remain computationally expensive and impracticable. We introduce a multistage scheme including various procedures for dimensionality reduction, consensus clustering of randomized samples, followed by the use of a fast supervised classification algorithm. Applying the Hybrid Bipartite Graph Formulation combined with rank ordering and SMO we obtained an area under the receiver operating curve of 0.987. The performance of the classification algorithm at the final stage is crucial for the effectiveness of this technique. It can be regarded as an indication of the reliability, quality and stability of the combined consensus clustering. © 2012 IEEE.
Empirical study of decision trees and ensemble classifiers for monitoring of diabetes patients in pervasive healthcare
- Authors: Kelarev, Andrei , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: Diabetes is a condition requiring continuous everyday monitoring of health related tests. To monitor specific clinical complications one has to find a small set of features to be collected from the sensors and efficient resource-aware algorithms for their processing. This article is concerned with the detection and monitoring of cardiovascular autonomic neuropathy, CAN, in diabetes patients. Using a small set of features identified previously, we carry out an empirical investigation and comparison of several ensemble methods based on decision trees for a novel application of the processing of sensor data from diabetes patients for pervasive health monitoring of CAN. Our experiments relied on an extensive database collected by the Diabetes Complications Screening Research Initiative at Charles Sturt University and concentrated on the particular task of the detection and monitoring of cardiovascular autonomic neuropathy. Most of the features in the database can now be collected using wearable sensors. Our experiments included several essential ensemble methods, a few more advanced and recent techniques, and a novel consensus function. The results show that our novel application of the decision trees in ensemble classifiers for the detection and monitoring of CAN in diabetes patients achieved better performance parameters compared with the outcomes obtained previously in the literature. © 2012 IEEE.
- Description: 2003009675
Improving classifications for cardiac autonomic neuropathy using multi-level ensemble classifiers and feature selection based on random forest
- Authors: Kelarev, Andrei , Stranieri, Andrew , Abawajy, Jemal , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference paper
- Relation: Tenth Australasian Data Mining Conference Vol. 134, p. 93-101
- Full Text: false
- Reviewed:
- Description: This paper is devoted to empirical investigation of novel multi-level ensemble meta classifiers for the detection and monitoring of progression of cardiac autonomic neuropathy, CAN, in diabetes patients. Our experiments relied on an extensive database and concentrated on ensembles of ensembles, or multi-level meta classifiers, for the classification of cardiac autonomic neuropathy progression. First, we carried out a thorough investigation comparing the performance of various base classifiers for several known sets of the most essential features in this database and determined that Random Forest significantly and consistently outperforms all other base classifiers in this new application. Second, we used feature selection and ranking implemented in Random Forest. It was able to identify a new set of features, which has turned out better than all other sets considered for this large and well-known database previously. Random Forest remained the very best classifier for the new set of features too. Third, we investigated meta classifiers and new multi-level meta classifiers based on Random Forest, which have improved its performance. The results obtained show that novel multi-level meta classifiers achieved further improvement and obtained new outcomes that are significantly better compared with the outcomes published in the literature previously for cardiac autonomic neuropathy.
Machine learning algorithms for analysis of DNA data sets
- Authors: Yearwood, John , Bagirov, Adil , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 47-58
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors' experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors' k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.
A data mining application of the incidence semirings
- Authors: Abawajy, Jemal , Kelarev, Andrei , Yearwood, John , Turville, Christopher
- Date: 2013
- Type: Text , Journal article
- Relation: Houston Journal of Mathematics Vol. 39, no. 4 (2013), p. 1083-1093
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: This paper is devoted to a combinatorial problem for incidence semirings, which can be viewed as sets of polynomials over graphs, where the edges are the unknowns and the coefficients are taken from a semiring. The construction of incidence rings is very well known and has many useful applications. The present article is devoted to a novel application of the more general incidence semirings. Recent research on data mining has motivated the investigation of the sets of centroids that have largest weights in semiring constructions. These sets are valuable for the design of centroid-based classification systems, or classifiers, as well as for the design of multiple classifiers combining several individual classifiers. Our article gives a complete description of all sets of centroids with the largest weight in incidence semirings.
Classification systems based on combinatorial semigroups
- Authors: Abawajy, Jemal , Kelarev, Andrei
- Date: 2013
- Type: Text , Journal article
- Relation: Semigroup forum Vol. 86, no. 3 (2013), p. 603-612
- Full Text: false
- Reviewed:
- Description: The present article continues the investigation of constructions essential for applications of combinatorial semigroups to the design of multiple classification systems in data mining. Our main theorem gives a complete description of all optimal classification systems defined by one-sided ideals in a construction based on combinatorial Rees matrix semigroups. It strengthens and generalizes previous results, which handled the more narrow case of two-sided ideals.
Empirical investigation of decision tree ensembles for monitoring cardiac complications of diabetes
- Authors: Kelarev, Andrei , Abawajy, Jemal , Stranieri, Andrew , Jelinek, Herbert
- Date: 2013
- Type: Text , Journal article
- Relation: International Journal of Data Warehousing and mining Vol. 9, no. 4 (2013), p. 1-18
- Full Text: false
- Reviewed:
- Description: Cardiac complications of diabetes require continuous monitoring since they may lead to increased morbidity or sudden death of patients. In order to monitor clinical complications of diabetes using wearable sensors, a small set of features have to be identified and effective algorithms for their processing need to be investigated. This article focuses on detecting and monitoring cardiac autonomic neuropathy (CAN) in diabetes patients. The authors investigate and compare the effectiveness of classifiers based on the following decision trees: ADTree, J48, NBTree, RandomTree, REPTree, and SimpleCart. The authors perform a thorough study comparing these decision trees as well as several decision tree ensembles created by applying the following ensemble methods: AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking, and two multi-level combinations of AdaBoost and MultiBoost with Bagging for the processing of data from diabetes patients for pervasive health monitoring of CAN. This paper concentrates on the particular task of applying decision tree ensembles for the detection and monitoring of cardiac autonomic neuropathy using these features. Experimental outcomes presented here show that the authors' application of the decision tree ensembles for the detection and monitoring of CAN in diabetes patients achieved better performance parameters compared with the results obtained previously in the literature.
Using meta-regression data mining to improve predictions of performance based on heart rate dynamics for Australian football
- Authors: Jelinek, Herbert , Kelarev, Andrei , Robinson, Dean , Stranieri, Andrew , Cornforth, David
- Date: 2014
- Type: Text , Journal article
- Relation: Applied Soft Computing Vol. 14, no. PART A (2014), p. 81-87
- Full Text: false
- Reviewed:
- Description: This work investigates the effectiveness of using computer-based machine learning regression algorithms and meta-regression methods to predict performance data for Australian football players based on parameters collected during daily physiological tests. Three experiments are described. The first uses all available data with a variety of regression techniques. The second uses a subset of features selected from the available data using the Random Forest method. The third used meta-regression with the selected feature subset. Our experiments demonstrate that feature selection and meta-regression methods improve the accuracy of predictions for match performance of Australian football players based on daily data of medical tests, compared to regression methods alone. Meta-regression methods and feature selection were able to obtain performance prediction outcomes with significant correlation coefficients. The best results were obtained by the additive regression based on isotonic regression for a set of most influential features selected by Random Forest. This model was able to predict athlete performance data with a correlation coefficient of 0.86 (p < 0.05). © 2013 Published by Elsevier B.V. All rights reserved.
- Description: C1