Hybrids of support vector machine wrapper and filter based framework for malware detection
- Authors: Huda, Shamsul , Abawajy, Jemal , Alazab, Mamoun , Abdollahian, Mali , Islam, Rafiqul , Yearwood, John
- Date: 2016
- Type: Text , Journal article
- Relation: Future Generation Computer Systems Vol. 55, no. (2016), p. 376-390
- Full Text: false
- Reviewed:
- Description: Malware replicates itself and produces offspring with the same characteristics but different signatures by using code obfuscation techniques. Current generation Anti-Virus (AV) engines employ a signature-template type detection approach where malware can easily evade existing signatures in the database. This reduces the capability of current AV engines in detecting malware. In this paper we propose a hybrid framework for malware detection by using the hybrids of Support Vector Machines Wrapper, Maximum-Relevance–Minimum-Redundancy Filter heuristics where Application Program Interface (API) call statistics are used as a malware features. The novelty of our hybrid framework is that it injects the filter’s ranking score in the wrapper selection process and combines the properties of both wrapper and filters and API call statistics which can detect malware based on the nature of infectious actions instead of signature. To the best of our knowledge, this kind of hybrid approach has not been explored yet in the literature in the context of feature selection and malware detection. Knowledge about the intrinsic characteristics of malicious activities is determined by the API call statistics which is injected as a filter score into the wrapper’s backward elimination process in order to find the most significant APIs. While using the most significant APIs in the wrapper classification on both obfuscated and benign types malware datasets, the results show that the proposed hybrid framework clearly surpasses the existing models including the independent filters and wrappers using only a very compact set of significant APIs. The performances of the proposed and existing models have further been compared using binary logistic regression. Various goodness of fit comparison criteria such as Chi Square, Akaike’s Information Criterion (AIC) and Receiver Operating Characteristic Curve ROC are deployed to identify the best performing models. Experimental outcomes based on the above criteria also show that the proposed hybrid framework outperforms other existing models of signature types including independent wrapper and filter approaches to identify malware.
Hybrids of support vector machine wrapper and filter based framework for malware detection
- Authors: Huda, Shamsul , Abawajy, Jemal , Alazab, Mamoun , Abdollalihiand, Mali , Islam, Rafiqul , Yearwood, John
- Date: 2016
- Type: Text , Journal article
- Relation: Future Generation Computer Systems Vol. 55, no. (2016), p. 376-390
- Full Text: false
- Reviewed:
- Description: Malware replicates itself and produces offspring with the same characteristics but different signatures by using code obfuscation techniques. Current generation Anti-Virus (AV) engines employ a signature-template type detection approach where malware can easily evade existing signatures in the database. This reduces the capability of current AV engines in detecting malware. In this paper we propose a hybrid framework for malware detection by using the hybrids of Support Vector Machines Wrapper, Maximum-Relevance–Minimum-Redundancy Filter heuristics where Application Program Interface (API) call statistics are used as a malware features. The novelty of our hybrid framework is that it injects the filter’s ranking score in the wrapper selection process and combines the properties of both wrapper and filters and API call statistics which can detect malware based on the nature of infectious actions instead of signature. To the best of our knowledge, this kind of hybrid approach has not been explored yet in the literature in the context of feature selection and malware detection. Knowledge about the intrinsic characteristics of malicious activities is determined by the API call statistics which is injected as a filter score into the wrapper’s backward elimination process in order to find the most significant APIs. While using the most significant APIs in the wrapper classification on both obfuscated and benign types malware datasets, the results show that the proposed hybrid framework clearly surpasses the existing models including the independent filters and wrappers using only a very compact set of significant APIs. The performances of the proposed and existing models have further been compared using binary logistic regression. Various goodness of fit comparison criteria such as Chi Square, Akaike’s Information Criterion (AIC) and Receiver Operating Characteristic Curve ROC are deployed to identify the best performing models. Experimental outcomes based on the above criteria also show that the proposed hybrid framework outperforms other existing models of signature types including independent wrapper and filter approaches to identify malware.
Constructing an inter-post similarity measure to differentiate the psychological stages in offensive chats
- Authors: Miah, Md Waliur Rahman , Yearwood, John , Kulkarni, Siddhivinayak
- Date: 2015
- Type: Text , Journal article
- Relation: Journal of the Association for Information Science and Technology Vol. 66, no. 5 (2015), p. 1065-1081
- Full Text: false
- Reviewed:
- Description: Offensive Internet chats, particularly the child-exploiting type, tend to follow a documented psychological behavioral pattern. Researchers have identified some important stages in this pattern. The psychological stages broadly include befriending, information exchange, grooming, and approach. Similarities among the posts of a chat play an important role in differentiating as well as in identifying these stages. In this article a novel similarity measure is constructed which gives high Inter-post-similarity among the chat-posts within a particular behavioral stage and low inter-post-similarity across different behavioral stages. A psychological stage corpus-based dictionary is constructed from mining the terms associated with each stage. The dictionary works as a background knowledge-base to support the similarity measure. To find the inter-post similarity a modified sentence similarity measure is used. The proposed measure gives improved recognition of inter-stage and intra-stage similarity among the chat posts compared with other types of similarity measures. The pairwise inter-post similarity is used for clustering chat-posts into the psychological stages. Results of experiments demonstrate that the new clustering method gives better results than some current clustering methods.
A hybrid wrapper-filter approach to detect the source(s) of out-of-control signals in multivariate manufacturing process
- Authors: Huda, Shamsul , Abdollahian, Mali , Mammadov, Musa , Yearwood, John , Ahmed, Shafiq , Sultan, Ibrahim
- Date: 2014
- Type: Text , Journal article
- Relation: European Journal of Operational Research Vol. 237, no. 3 (2014), p. 857-870
- Full Text: false
- Reviewed:
- Description: With modern data-Acquisition equipment and on-line computers used during production, it is now common to monitor several correlated quality characteristics simultaneously in multivariate processes. Multivariate control charts (MCC) are important tools for monitoring multivariate processes. One difficulty encountered with multivariate control charts is the identification of the variable or group of variables that cause an out-of-control signal. Expert knowledge either in combination with wrapper-based supervised classifier or a pre-filter with wrapper are the standard approaches to detect the sources of out-of-control signal. However gathering expert knowledge in source identification is costly and may introduce human error. Individual univariate control charts (UCC) and decomposition of T2 statistics are also used in many cases simultaneously to identify the sources, but these either ignore the correlations between the sources or may take more time with the increase of dimensions. The aim of this paper is to develop a source identification approach that does not need any expert-knowledge and can detect out-of-control signal in less computational complexity. We propose, a hybrid wrapper-filter based source identification approach that hybridizes a Mutual Information (MI) based Maximum Relevance (MR) filter ranking heuristic with an Artificial Neural Network (ANN) based wrapper. The Artificial Neural Network Input Gain Measurement Approximation (ANNIGMA) has been combined with MR (MR-ANNIGMA) to utilize the knowledge about the intrinsic pattern of the quality characteristics computed by the filter for directing the wrapper search process. To compute optimal ANNIGMA score, we also propose a Global MR-ANNIGMA using non-functional relationship between variables which is independent of the derivative of the objective function and has a potential to overcome the local optimization problem of ANN training. The novelty of the proposed approaches is that they combine the advantages of both filter and wrapper approaches and do not require any expert knowledge about the sources of the out-of-control signals. Heuristic score based subset generation process also reduces the search space into polynomial growth which in turns reduces computational time. The proposed approaches were tested by exhaustive experiments using both simulated and real manufacturing data and compared to existing methods including independent filter, wrapper and Multivariate EWMA (MEWMA) methods. The results indicate that the proposed approaches can identify the sources of out-of-control signals more accurately than existing approaches. © 2014 Elsevier B.V. All rights reserved.
A new loss function for robust classification
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2014
- Type: Text , Journal article
- Relation: Intelligent Data Analysis Vol. 18, no. 4 (2014), p. 697-715
- Full Text: false
- Reviewed:
- Description: Loss function plays an important role in data classification. Manyloss functions have been proposed and applied to differentclassification problems. This paper proposes a new so called thesmoothed 0-1 loss function, that could be considered as anapproximation of the classical 0-1 loss function. Due to thenon-convexity property of the proposed loss function, globaloptimization methods are required to solve the correspondingoptimization problems. Together with the proposed loss function, wecompare the performance of several existing loss functions in theclassification of noisy data sets. In this comparison, differentoptimization problems are considered in regards to the convexity andsmoothness of different loss functions. The experimental resultsshow that the proposed smoothed 0-1 loss function works better ondata sets with noisy labels, noisy features, and outliers. © 2014 - IOS Press and the authors. All rights reserved.
A theoretical foundation of demand driven web services
- Authors: Sun, Zhaohao , Yearwood, John
- Date: 2014
- Type: Text , Book chapter
- Relation: Demand-driven web services p. 1-32
- Full Text: false
- Reviewed:
- Description: Web services are playing a pivotal role in business, management, governance, and society with the dramatic development of the Internet and the Web. However, many fundamental issues are still ignored to some extent. For example, what is the unified perspective to the state-of-the-art of Web services? What is the foundation of Demand-Driven Web Services (DDWS)? This chapter addresses these fundamental issues by examining the state-of-the-art of Web services and proposing a theoretical and technological foundation for demand-driven Web services with applications. This chapter also presents an extended Service-Oriented Architecture (SOA), eSMACS SOA, and examines main players in this architecture. This chapter then classifies DDWS as government DDWS, organizational DDWS, enterprise DDWS, customer DDWS, and citizen DDWS, and looks at the corresponding Web services. Finally, this chapter examines the theoretical, technical foundations for DDWS with applications. The proposed approaches will facilitate research and development of Web services, mobile services, cloud services, and social services.
Analytics service oriented architecture for enterprise information systems
- Authors: Sun, Zhaohao , Strang, Kenneth , Yearwood, John
- Date: 2014
- Type: Text , Conference paper
- Relation: 16th International Conference on Information Integration and Web-based Applications & Services
- Full Text: false
- Reviewed:
- Description: Big data analytics and business analytics are disruptive technology and innovative solution for enterprise development. However, what is the relationship between big data analytics and business analytics? What is the relationship between business analytics and enterprise information systems (EIS)? How can business analytics enhance the development of EIS? These are still big issues for EIS development. This paper addresses these three issues by proposing an ontology of business analytics, presenting an analytics service-oriented architecture (ASOA) and applying ASOA to EIS, where our surveyed data analysis showed that the proposed ASOA can enhance to develop EIS. This paper also discusses the interrelationship between data analysis and business analytics, and between data analytics and big data analytics. The proposed approaches in this paper will facilitate research and development of EIS, business analytics, big data analytics, and business intelligence.
Handbook of research on demand-driven web services : Theory, technologies, and applications
- Authors: Sun, Zhaohao , Yearwood, John
- Date: 2014
- Type: Text , Edited book
- Full Text: false
Hybrid metaheuristic approaches to the expectation maximization for estimation of the hidden markov model for signal modeling
- Authors: Huda, Shamsul , Yearwood, John , Togneri, Roberto
- Date: 2014
- Type: Text , Journal article
- Relation: IEEE Transactions on Cybernetics Vol. 44, no. 10 (2014), p. 1962-1977
- Full Text: false
- Reviewed:
- Description: The expectation maximization (EM) is the standard training algorithm for hidden Markov model (HMM). However, EM faces a local convergence problem in HMM estimation. This paper attempts to overcome this problem of EM and proposes hybrid metaheuristic approaches to EM for HMM. In our earlier research, a hybrid of a constraint-based evolutionary learning approach to EM (CEL-EM) improved HMM estimation. In this paper, we propose a hybrid simulated annealing stochastic version of EM (SASEM) that combines simulated annealing (SA) with EM. The novelty of our approach is that we develop a mathematical reformulation of HMM estimation by introducing a stochastic step between the EM steps and combine SA with EM to provide better control over the acceptance of stochastic and EM steps for better HMM estimation. We also extend our earlier work [1] and propose a second hybrid which is a combination of an EA and the proposed SASEM, (EA-SASEM). The proposed EA-SASEM uses the best constraint-based EA strategies from CEL-EM and stochastic reformulation of HMM. The complementary properties of EA and SA and stochastic reformulation of HMM of SASEM provide EA-SASEM with sufficient potential to find better estimation for HMM. To the best of our knowledge, this type of hybridization and mathematical reformulation have not been explored in the context of EM and HMM training. The proposed approaches have been evaluated through comprehensive experiments to justify their effectiveness in signal modeling using the speech corpus: TIMIT. Experimental results show that proposed approaches obtain higher recognition accuracies than the EM algorithm and CEL-EM as well. © 2014 IEEE.
Performance evaluation of multivariate non-normal process using metaheuristic approaches
- Authors: Ahmad, S. , Abdollahian, Mali , Bhatti, M.I. , Huda, Shamsul , Yearwood, John
- Date: 2014
- Type: Text , Journal article
- Relation: Journal of Applied Statistical Science Vol. 20, no. 3 (2014), p. 299-315
- Full Text: false
- Reviewed:
- Description: Multivariate process performance indices generally rely on the assumption that the process follow normal distribution but in practice its non-normal with correlated characteristics patterns. This paper proposes two metaheuristic-based approaches to fit Burr distribution to such data; a single candidate model based approach using a Simulated Annealing (SA) technique and a population based approach using a constraint-based Evolutionary Alogorithn (EA). The fitted Burr distribution is then used to estimate the proportion of Non-conforming (PNC) which is then used to fit an appropiate Burr distribution to individual Geometric distance variables. Empirical performance of the proposed methods have been evaluated on real industrial data set using PNC criterion. Experimental results demonstrate that the new approach perform well than the existing.
A data mining application of the incidence semirings
- Authors: Abawajy, Jemal , Kelarev, Andrei , Yearwood, John , Turville, Christopher
- Date: 2013
- Type: Text , Journal article
- Relation: Houston Journal of Mathematics Vol. 39, no. 4 (2013), p. 1083-1093
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: This paper is devoted to a combinatorial problem for incidence semirings, which can be viewed as sets of polynomials over graphs, where the edges are the unknowns and the coefficients are taken from a semiring. The construction of incidence rings is very well known and has many useful applications. The present article is devoted to a novel application of the more general incidence semirings. Recent research on data mining has motivated the investigation of the sets of centroids that have largest weights in semiring constructions. These sets are valuable for the design of centroid-based classification systems, or classifiers, as well as for the design of multiple classifiers combining several individual classifiers. Our article gives a complete description of all sets of centroids with the largest weight in incidence semirings.
An algorithm for minimization of pumping costs in water distribution systems using a novel approach to pump scheduling
- Authors: Bagirov, Adil , Barton, Andrew , Mala-Jetmarova, Helena , Al Nuaimat, Alia , Ahmed, S. T. , Sultanova, Nargiz , Yearwood, John
- Date: 2013
- Type: Text , Journal article
- Relation: Mathematical and Computer Modelling Vol. 57, no. 3-4 (2013), p. 873-886
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: The operation of a water distribution system is a complex task which involves scheduling of pumps, regulating water levels of storages, and providing satisfactory water quality to customers at required flow and pressure. Pump scheduling is one of the most important tasks of the operation of a water distribution system as it represents the major part of its operating costs. In this paper, a novel approach for modeling of explicit pump scheduling to minimize energy consumption by pumps is introduced which uses the pump start/end run times as continuous variables, and binary integer variables to describe the pump status at the beginning of the scheduling period. This is different from other approaches where binary integer variables for each hour are typically used, which is considered very impractical from an operational perspective. The problem is formulated as a mixed integer nonlinear programming problem, and a new algorithm is developed for its solution. This algorithm is based on the combination of the grid search with the Hooke-Jeeves pattern search method. The performance of the algorithm is evaluated using literature test problems applying the hydraulic simulation model EPANet. © 2012 Elsevier Ltd.
- Description: 2003010583
A novel approach to optimal pump scheduling in water distribution systems
- Authors: Bagirov, Adil , Barton, Andrew , Mala-Jetmarova, Helena , Al Nuaimat, Alia , Ahmed, S. T. , Sultanova, Nargiz , Yearwood, John
- Date: 2012
- Type: Text , Conference paper
- Relation: 14th Water Distribution Systems Analysis Conference 2012, WDSA 2012 Vol. 1; Adelaide, Australia; 24th-27th September; p. 618-631
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: The operation of a water distribution system is a complex task which involves scheduling of pumps, regulating water levels of storages, and providing satisfactory water quality to customers at required flow and pressure. Pump scheduling is one of the most important tasks of the operation of a water distribution system as it represents the major part of its operating costs. In this paper, a novel approach for modeling of pump scheduling to minimize energy consumption by pumps is introduced which uses pump's start/end run times as continuous variables. This is different from other approaches where binary integer variables for each hour are typically used which is considered very impractical from an operational perspective. The problem is formulated as a nonlinear programming problem and a new algorithm is developed for its solution. This algorithm is based on the combination of the grid search with the Hooke-Jeeves pattern search method. The performance of the algorithm is evaluated using literature test problems applying the hydraulic simulation model EPANet.
- Description: E1
Applications of machine learning for linguistic analysis of texts
- Authors: Torney, Rosemary , Yearwood, John , Vamplew, Peter , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 133-148
- Full Text: false
- Reviewed:
- Description: This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors' experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.
Approaches for community decision making and collective reasoning: Knowledge technology support
- Authors: Yearwood, John , Stranieri, Andrew
- Date: 2012
- Type: Text , Book
- Relation: Approaches for Community Decision Making and Collective Reasoning: Knowledge Technology Support
- Full Text: false
- Reviewed:
- Description: Technology currently encourages the capture and storage of vast quantities of data and information and so thinkers, reasoners, and decision-makers have available large resources to support their tasks. At the same time, there is a need to engage with an enormous range of complex issues that require reasoning and decisions that are actionable to address them. Approaches for Community Decision Making and Collective Reasoning: Knowledge Technology Support acts to provide knowledge for each individual in a group with the broad structural wealth of reasoning. It also acts as an explicit structure that technological devices for supporting reasoning within a group can hook onto. If you are interested in how groups can structure their activities towards making better decisions or in developing technologies for the support of decision-making in groups, then this book is an excellent way to understand the state of the art and possible ways forward.
Detection of CAN by ensemble classifiers based on Ripple Down rules
- Authors: Kelarev, Andrei , Dazeley, Richard , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Book chapter
- Relation: Knowledge Management and Acquisition for Intelligent Systems p. 147-159
- Full Text: false
- Reviewed:
- Description: It is well known that classification models produced by the Ripple Down Rules are easier to maintain and update. They are compact and can provide an explanation of their reasoning making them easy to understand for medical practitioners. This article is devoted to an empirical investigation and comparison of several ensemble methods based on Ripple Down Rules in a novel application for the detection of cardiovascular autonomic neuropathy (CAN) from an extensive data set collected by the Diabetes Complications Screening Research Initiative at Charles Sturt University. Our experiments included essential ensemble methods, several more recent state-of-the-art techniques, and a novel consensus function based on graph partitioning. The results show that our novel application of Ripple Down Rules in ensemble classifiers for the detection of CAN achieved better performance parameters compared with the outcomes obtained previously in the literature.
Empirical investigation of consensus clustering for large ECG data sets
- Authors: Kelarev, Andrei , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: This article investigates a novel machine learning approach applying consensus clustering in conjunction with classification for the data mining of very large and highly dimensional ECG data sets. To obtain robust and stable clusterings, consensus functions can be applied for clustering ensembles combining a multitude of independent initial clusterings. Direct applications of consensus functions to highly dimensional ECG data sets remain computationally expensive and impracticable. We introduce a multistage scheme including various procedures for dimensionality reduction, consensus clustering of randomized samples, followed by the use of a fast supervised classification algorithm. Applying the Hybrid Bipartite Graph Formulation combined with rank ordering and SMO we obtained an area under the receiver operating curve of 0.987. The performance of the classification algorithm at the final stage is crucial for the effectiveness of this technique. It can be regarded as an indication of the reliability, quality and stability of the combined consensus clustering. © 2012 IEEE.
Empirical study of decision trees and ensemble classifiers for monitoring of diabetes patients in pervasive healthcare
- Authors: Kelarev, Andrei , Stranieri, Andrew , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: Diabetes is a condition requiring continuous everyday monitoring of health related tests. To monitor specific clinical complications one has to find a small set of features to be collected from the sensors and efficient resource-aware algorithms for their processing. This article is concerned with the detection and monitoring of cardiovascular autonomic neuropathy, CAN, in diabetes patients. Using a small set of features identified previously, we carry out an empirical investigation and comparison of several ensemble methods based on decision trees for a novel application of the processing of sensor data from diabetes patients for pervasive health monitoring of CAN. Our experiments relied on an extensive database collected by the Diabetes Complications Screening Research Initiative at Charles Sturt University and concentrated on the particular task of the detection and monitoring of cardiovascular autonomic neuropathy. Most of the features in the database can now be collected using wearable sensors. Our experiments included several essential ensemble methods, a few more advanced and recent techniques, and a novel consensus function. The results show that our novel application of the decision trees in ensemble classifiers for the detection and monitoring of CAN in diabetes patients achieved better performance parameters compared with the outcomes obtained previously in the literature. © 2012 IEEE.
- Description: 2003009675
Improving classifications for cardiac autonomic neuropathy using multi-level ensemble classifiers and feature selection based on random forest
- Authors: Kelarev, Andrei , Stranieri, Andrew , Abawajy, Jemal , Yearwood, John , Jelinek, Herbert
- Date: 2012
- Type: Text , Conference paper
- Relation: Tenth Australasian Data Mining Conference Vol. 134, p. 93-101
- Full Text: false
- Reviewed:
- Description: This paper is devoted to empirical investigation of novel multi-level ensemble meta classifiers for the detection and monitoring of progression of cardiac autonomic neuropathy, CAN, in diabetes patients. Our experiments relied on an extensive database and concentrated on ensembles of ensembles, or multi-level meta classifiers, for the classification of cardiac autonomic neuropathy progression. First, we carried out a thorough investigation comparing the performance of various base classifiers for several known sets of the most essential features in this database and determined that Random Forest significantly and consistently outperforms all other base classifiers in this new application. Second, we used feature selection and ranking implemented in Random Forest. It was able to identify a new set of features, which has turned out better than all other sets considered for this large and well-known database previously. Random Forest remained the very best classifier for the new set of features too. Third, we investigated meta classifiers and new multi-level meta classifiers based on Random Forest, which have improved its performance. The results obtained show that novel multi-level meta classifiers achieved further improvement and obtained new outcomes that are significantly better compared with the outcomes published in the literature previously for cardiac autonomic neuropathy.
Machine learning algorithms for analysis of DNA data sets
- Authors: Yearwood, John , Bagirov, Adil , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 47-58
- Relation: http://purl.org/au-research/grants/arc/LP0990908
- Full Text: false
- Reviewed:
- Description: The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors' experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors' k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.