A semantic method to information extraction for decision support systems
- Authors: Ofoghi, Bahadorreza , Yearwood, John , Ghosh, Ranadhir
- Date: 2006
- Type: Text , Conference proceedings
- Full Text: false
- Description: In this paper, we describe a novel schema for a more semantic text mining process which results in more comprehensive decision making activity by decision support systems via providing more effective and accurate textual information. The utility of two semantic lexical resources; Frame Net and Word Net, in extracting required text snippets from unstructured free texts yields a better and more accurate information extraction process to deliver more precise information either to a DSS or to a decision maker. We explain how the usage of these lexical resources could elevate a focused text mining process which could be applied to an information provider system in a decision support paradigm. The preliminary results obtained after a starter experiment show that the hybrid information extraction schema performs well on some semantic failure situations.
- Description: 2003010644
Automatic sleep stage identification: difficulties and possible solutions
- Authors: Sukhorukova, Nadezda , Stranieri, Andrew , Ofoghi, Bahadorreza , Vamplew, Peter , Saleem, Muhammad Saad , Ma, Liping , Ugon, Adrien , Ugon, Julien , Muecke, Nial , Amiel, Hélène , Philippe, Carole , Bani-Mustafa, Ahmed , Huda, Shamsul , Bertoli, Marcello , Levy, P , Ganascia, J.G
- Date: 2010
- Type: Text , Conference proceedings
- Full Text:
- Description: The diagnosis of many sleep disorders is a labour intensive task that involves the specialised interpretation of numerous signals including brain wave, breath and heart rate captured in overnight polysomnogram sessions. The automation of diagnoses is challenging for data mining algorithms because the data sets are extremely large and noisy, the signals are complex and specialist's analyses vary. This work reports on the adaptation of approaches from four fields; neural networks, mathematical optimisation, financial forecasting and frequency domain analysis to the problem of automatically determing a patient's stage of sleep. Results, though preliminary, are promising and indicate that combined approaches may prove more fruitful than the reliance on a approach.
Rule-based classifiers and meta classifiers for identification of cardiac autonomic neuropathy progression
- Authors: Jelinek, Herbert , Kelarev, Andrei , Stranieri, Andrew , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: International Journal of Information Science and Computer Mathematics Vol. 5, no. 2 (2012), p. 49-53
- Full Text:
- Reviewed:
- Description: We investigate and compare several rule-based classifiers and meta classifiers in their ability to obtain multi-class classifications of cardiac autonomic neuropathy (CAN) and its progression. The best results obtained in our experiments are significantly better than the outcomes published previously in the literature for analogous CAN identification tasks or simpler binary classification tasks.
Optimization of classifiers for data mining based on combinatorial semigroups
- Authors: Kelarev, Andrei , Yearwood, John , Watters, Paul
- Date: 2011
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 82, no. 2 (2011), p. 1-10
- Full Text:
- Reviewed:
- Description: The aim of the present article is to obtain a theoretical result essential for applications of combinatorial semigroups for the design of multiple classification systems in data mining. We consider a novel construction of multiple classification systems, or classifiers, combining several binary classifiers. The construction is based on combinatorial Rees matrix semigroups without any restrictions on the sandwich-matrix. Our main theorem gives a complete description of all optimal classifiers in this novel construction. © 2011 Springer Science+Business Media, LLC.
Classification systems based on combinatorial semigroups
- Authors: Abawajy, Jemal , Kelarev, Andrei
- Date: 2013
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 86, no. 3 (2013), p. 603-612
- Full Text:
- Reviewed:
- Description: The present article continues the investigation of constructions essential for applications of combinatorial semigroups to the design of multiple classification systems in data mining. Our main theorem gives a complete description of all optimal classification systems defined by one-sided ideals in a construction based on combinatorial Rees matrix semigroups. It strengthens and generalizes previous results, which handled the more narrow case of two-sided ideals. © 2012 Springer Science+Business Media New York.
- Description: 2003011021
Efficient piecewise linear classifiers and applications
- Authors: Webb, Dean
- Date: 2011
- Type: Text , Thesis , PhD
- Full Text:
- Description: Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.
- Description: Doctor of Philosophy
A formula for multiple classifiers in data mining based on Brandt semigroups
- Authors: Kelarev, Andrei , Yearwood, John , Mammadov, Musa
- Date: 2009
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 78, no. 2 (2009), p. 293-309
- Full Text:
- Reviewed:
- Description: A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups. © 2008 Springer Science+Business Media, LLC.
Rees matrix constructions for clustering of data
- Authors: Kelarev, Andrei , Watters, Paul , Yearwood, John
- Date: 2009
- Type: Journal article
- Relation: Journal of the Australian Mathematical Society Vol. 87, no. 3 (2009), p. 377-393
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text:
- Reviewed:
- Description: This paper continues the investigation of semigroup constructions motivated by applications in data mining. We give a complete description of the error-correcting capabilities of a large family of clusterers based on Rees matrix semigroups well known in semigroup theory. This result strengthens and complements previous formulas recently obtained in the literature. Examples show that our theorems do not generalize to other classes of semigroups.
Visual tools for analysing evolution, emergence, and error in data streams
- Authors: Hart, Sol , Yearwood, John , Bagirov, Adil
- Date: 2007
- Type: Text , Conference paper
- Relation: Paper presented at 6th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2007, Melbourne, Victoria : 11th-13th July 2007 p. 987-992
- Full Text:
- Description: The relatively new field of stream mining has necessitated the development of robust drift-aware algorithms that provide accurate, real time, data handling capabilities. Tools are needed to assess and diagnose important trends and investigate drift evolution parameters. In this paper, we present two new and novel visualisation techniques, Pixie and Luna graphs, which incorporate salient group statistics coupled with intuitive visual representations of multidimensional groupings over time. Through the novel representations presented here, spatial interactions between temporal divisions can be diagnosed and overall distribution patterns identified. It provides a means of evaluating in non-constrained capacity, commonly constrained evolutionary problems.
- Description: 2003005432
Classes and clusters in data analysis
- Authors: Rubinov, Alex , Sukhorukova, Nadezda , Ugon, Julien
- Date: 2006
- Type: Text , Journal article
- Relation: European Journal of Operational Research Vol. 173, no. 3 (Sep 2006), p. 849-865
- Full Text:
- Reviewed:
- Description: We discuss the relation between classes and clusters in datasets with given classes. We examine the distribution of classes within obtained clusters, using different clustering methods which are based on different techniques. We also study the structure of the obtained clusters. One of the main conclusions, obtained in this research is that the notion purity cannot be always used for evaluation of accuracy of clustering techniques. (c) 2005 Elsevier B.V. All rights reserved.
- Description: C1
- Description: 2003001593
Derivative-free hybrid methods in global optimization and their applications
- Authors: Zhang, Jiapu
- Date: 2005
- Type: Text , Thesis , PhD
- Full Text:
- Description: In recent years large-scale global optimization (GO) problems have drawn considerable attention. These problems have many applications, in particular in data mining and biochemistry. Numerical methods for GO are often very time consuming and could not be applied for high-dimensional non-convex and / or non-smooth optimization problems. The thesis explores reasons why we need to develop and study new algorithms for solving large-scale GO problems .... The thesis presents several derivative-free hybrid methods for large scale GO problems. These methods do not guarantee the calculation of a global solution; however, results of numerical experiments presented in this thesis demonstrate that they, as a rule, calculate a solution which is a global one or close to it. Their applications to data mining problems and the protein folding problem are demonstrated.
- Description: Doctor of Philosophy
Novel data mining techniques for incompleted clinical data in diabetes management
- Authors: Jelinek, Herbert , Yatsko, Andrew , Stranieri, Andrew , Venkatraman, Sitalakshmi
- Date: 2014
- Type: Text , Journal article
- Relation: British Journal of Applied Science & Technology Vol. 4, no. 33 (2014), p. 4591-4606
- Relation: https://doi.org/10.9734/BJAST/2014/11744
- Full Text:
- Reviewed:
- Description: An important part of health care involves upkeep and interpretation of medical databases containing patient records for clinical decision making, diagnosis and follow-up treatment. Missing clinical entries make it difficult to apply data mining algorithms for clinical decision support. This study demonstrates that higher predictive accuracy is possible using conventional data mining algorithms if missing values are dealt with appropriately. We propose a novel algorithm using a convolution of sub-problems to stage a super problem, where classes are defined by Cartesian Product of class values of the underlying problems, and Incomplete Information Dismissal and Data Completion techniques are applied for reducing features and imputing missing values. Predictive accuracies using Decision Branch, Nearest Neighborhood and Naïve Bayesian classifiers were compared to predict diabetes, cardiovascular disease and hypertension. Data is derived from Diabetes Screening Complications Research Initiative (DiScRi) conducted at a regional Australian university involving more than 2400 patient records with more than one hundred clinical risk factors (attributes). The results show substantial improvements in the accuracy achieved with each classifier for an effective diagnosis of diabetes, cardiovascular disease and hypertension as compared to those achieved without substituting missing values. The gain in improvement is 7% for diabetes, 21% for cardiovascular disease and 24% for hypertension, and our integrated novel approach has resulted in more than 90% accuracy for the diagnosis of any of the three conditions. This work advances data mining research towards achieving an integrated and holistic management of diabetes. - See more at: http://www.sciencedomain.org/abstract.php?iid=670&id=5&aid=6128#.VCSxDfmSx8E
Accurate and efficient clustering algorithms for very large data sets
- Authors: Quddus, Syed
- Date: 2017
- Type: Text , Thesis , PhD
- Full Text:
- Description: The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.
- Description: Doctor of Philosophy
Themes in data mining, big data, and crime analytics
- Authors: Oatley,Giles
- Date: 2022
- Type: Text , Journal article , Review
- Relation: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Vol. 12, no. 2 (2022), p.
- Full Text:
- Reviewed:
- Description: This article examines the impact of new AI-related technologies in data mining and big data on important research questions in crime analytics. Because the field is so broad, the review focuses on a selection of the most important topics. Challenges for information management, and in turn law and society, include: AI-powered predictive policing; big data for legal and adversarial decisions; bias using big data and analytics in profiling and predicting criminality; forecasting crime risk and crime rates; and, regulating AI systems. This article is categorized under: Algorithmic Development > Spatial and Temporal Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Artificial Intelligence Application Areas > Data Mining Software Tools. © 2021 Wiley Periodicals LLC.
A technique for parallel share-frequent sensor pattern mining from wireless sensor networks
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2014
- Type: Text , Conference paper
- Relation: 14th Annual International Conference on Computational Science, ICCS 2014; Cairns, Australia; 10th-12th June 2014; published in Procedia Computer Science p. 124-133
- Full Text:
- Reviewed:
- Description: WSNs generate huge amount of data in the form of streams and mining useful knowledge from these streams is a challenging task. Existing works generate sensor association rules using occurrence frequency of patterns with binary frequency (either absent or present) or support of a pattern as a criterion. However, considering the binary frequency or support of a pattern may not be a sufficient indicator for finding meaningful patterns from WSN data because it only reflects the number of epochs in the sensor data which contain that pattern. The share measure of sensorsets could discover useful knowledge about numerical values associated with sensor in a sensor database. Therefore, in this paper, we propose a new type of behavioral pattern called share-frequent sensor patterns by considering the non-binary frequency values of sensors in epochs. To discover share-frequent sensor patterns from sensor dataset, we propose a novel parallel technique. In this technique, we develop a novel tree structure, called parallel share-frequent sensor pattern tree (PShrFSP-tree) that is constructed at each local node independently, by capturing the database contents to generate the candidate patterns using a pattern growth technique with a single scan and then merges the locally generated candidate patterns at the final stage to generate global share-frequent sensor patterns. Comprehensive experimental results show that our proposed model is very efficient for mining share-frequent patterns from WSN data in terms of time and scalability.
A Survey on Behavioral Pattern Mining from Sensor Data in Internet of Things
- Authors: Rashid, Md Mamunur , Kamruzzaman, Joarder , Hassan, Mohammad , Shahriar Shafin, Sakib , Bhuiyan, Md Zakirul
- Date: 2020
- Type: Text , Journal article
- Relation: IEEE Access Vol. 8, no. (2020), p. 33318-33341
- Full Text:
- Reviewed:
- Description: The deployment of large-scale wireless sensor networks (WSNs) for the Internet of Things (IoT) applications is increasing day-by-day, especially with the emergence of smart city services. The sensor data streams generated from these applications are largely dynamic, heterogeneous, and often geographically distributed over large areas. For high-value use in business, industry and services, these data streams must be mined to extract insightful knowledge, such as about monitoring (e.g., discovering certain behaviors over a deployed area) or network diagnostics (e.g., predicting faulty sensor nodes). However, due to the inherent constraints of sensor networks and application requirements, traditional data mining techniques cannot be directly used to mine IoT data streams efficiently and accurately in real-time. In the last decade, a number of works have been reported in the literature proposing behavioral pattern mining algorithms for sensor networks. This paper presents the technical challenges that need to be considered for mining sensor data. It then provides a thorough review of the mining techniques proposed in the recent literature to mine behavioral patterns from sensor data in IoT, and their characteristics and differences are highlighted and compared. We also propose a behavioral pattern mining framework for IoT and discuss possible future research directions in this area. © 2013 IEEE.
From convex to nonconvex: A loss function analysis for binary classification
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.
Structure learning of Bayesian networks using a new unrestricted dependency algorithm
- Authors: Taheri, Sona , Mammadov, Musa
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Bayesian Networks have deserved extensive attentions in data mining due to their efficiencies, and reasonable predictive accuracy. A Bayesian Network is a directed acyclic graph in which each node represents a variable and each arc a probabilistic dependency between two variables. Constructing a Bayesian Network from data is the learning process that is divided in two steps: learning structure and learning parameter. In many domains, the structure is not known a priori and must be inferred from data. This paper presents an iterative unrestricted dependency algorithm for learning structure of Bayesian Networks for binary classification problems. Numerical experiments are conducted on several real world data sets, where continuous features are discretized by applying two different methods. The performance of the proposed algorithm is compared with the Naive Bayes, the Tree Augmented Naive Bayes, and the k
Visual grouping of association rules for hypotheses suggestion
- Authors: Ivkovic, Sasha
- Date: 2003
- Type: Text , Thesis , Masters
- Full Text:
- Description: The study descibes a KDD method that is being used by non-technical experts with mimimal training to discover and interpret patterns that they find useful for their role within their organisations.
- Description: Master of Information Technology
Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine
- Authors: Khraisat, Ansam , Gondal, Iqbal , Vamplew, Peter , Kamruzzaman, Joarder , Alazab, Ammar
- Date: 2020
- Type: Text , Journal article
- Relation: Electronics (Switzerland) Vol. 9, no. 1 (2020), p.
- Full Text:
- Reviewed:
- Description: Cyberttacks are becoming increasingly sophisticated, necessitating the efficient intrusion detection mechanisms to monitor computer resources and generate reports on anomalous or suspicious activities. Many Intrusion Detection Systems (IDSs) use a single classifier for identifying intrusions. Single classifier IDSs are unable to achieve high accuracy and low false alarm rates due to polymorphic, metamorphic, and zero-day behaviors of malware. In this paper, a Hybrid IDS (HIDS) is proposed by combining the C5 decision tree classifier and One Class Support Vector Machine (OC-SVM). HIDS combines the strengths of SIDS) and Anomaly-based Intrusion Detection System (AIDS). The SIDS was developed based on the C5.0 Decision tree classifier and AIDS was developed based on the one-class Support Vector Machine (SVM). This framework aims to identify both the well-known intrusions and zero-day attacks with high detection accuracy and low false-alarm rates. The proposed HIDS is evaluated using the benchmark datasets, namely, Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) and Australian Defence Force Academy (ADFA) datasets. Studies show that the performance of HIDS is enhanced, compared to SIDS and AIDS in terms of detection rate and low false-alarm rates. © 2020 by the authors. Licensee MDPI, Basel, Switzerland.