Themes in data mining, big data, and crime analytics
- Authors: Oatley,Giles
- Date: 2022
- Type: Text , Journal article , Review
- Relation: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Vol. 12, no. 2 (2022), p.
- Full Text:
- Reviewed:
- Description: This article examines the impact of new AI-related technologies in data mining and big data on important research questions in crime analytics. Because the field is so broad, the review focuses on a selection of the most important topics. Challenges for information management, and in turn law and society, include: AI-powered predictive policing; big data for legal and adversarial decisions; bias using big data and analytics in profiling and predicting criminality; forecasting crime risk and crime rates; and, regulating AI systems. This article is categorized under: Algorithmic Development > Spatial and Temporal Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Artificial Intelligence Application Areas > Data Mining Software Tools. © 2021 Wiley Periodicals LLC.
- Authors: Oatley,Giles
- Date: 2022
- Type: Text , Journal article , Review
- Relation: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Vol. 12, no. 2 (2022), p.
- Full Text:
- Reviewed:
- Description: This article examines the impact of new AI-related technologies in data mining and big data on important research questions in crime analytics. Because the field is so broad, the review focuses on a selection of the most important topics. Challenges for information management, and in turn law and society, include: AI-powered predictive policing; big data for legal and adversarial decisions; bias using big data and analytics in profiling and predicting criminality; forecasting crime risk and crime rates; and, regulating AI systems. This article is categorized under: Algorithmic Development > Spatial and Temporal Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Artificial Intelligence Application Areas > Data Mining Software Tools. © 2021 Wiley Periodicals LLC.
A Survey on Behavioral Pattern Mining from Sensor Data in Internet of Things
- Rashid, Md Mamunur, Kamruzzaman, Joarder, Hassan, Mohammad, Shahriar Shafin, Sakib, Bhuiyan, Md Zakirul
- Authors: Rashid, Md Mamunur , Kamruzzaman, Joarder , Hassan, Mohammad , Shahriar Shafin, Sakib , Bhuiyan, Md Zakirul
- Date: 2020
- Type: Text , Journal article
- Relation: IEEE Access Vol. 8, no. (2020), p. 33318-33341
- Full Text:
- Reviewed:
- Description: The deployment of large-scale wireless sensor networks (WSNs) for the Internet of Things (IoT) applications is increasing day-by-day, especially with the emergence of smart city services. The sensor data streams generated from these applications are largely dynamic, heterogeneous, and often geographically distributed over large areas. For high-value use in business, industry and services, these data streams must be mined to extract insightful knowledge, such as about monitoring (e.g., discovering certain behaviors over a deployed area) or network diagnostics (e.g., predicting faulty sensor nodes). However, due to the inherent constraints of sensor networks and application requirements, traditional data mining techniques cannot be directly used to mine IoT data streams efficiently and accurately in real-time. In the last decade, a number of works have been reported in the literature proposing behavioral pattern mining algorithms for sensor networks. This paper presents the technical challenges that need to be considered for mining sensor data. It then provides a thorough review of the mining techniques proposed in the recent literature to mine behavioral patterns from sensor data in IoT, and their characteristics and differences are highlighted and compared. We also propose a behavioral pattern mining framework for IoT and discuss possible future research directions in this area. © 2013 IEEE.
- Authors: Rashid, Md Mamunur , Kamruzzaman, Joarder , Hassan, Mohammad , Shahriar Shafin, Sakib , Bhuiyan, Md Zakirul
- Date: 2020
- Type: Text , Journal article
- Relation: IEEE Access Vol. 8, no. (2020), p. 33318-33341
- Full Text:
- Reviewed:
- Description: The deployment of large-scale wireless sensor networks (WSNs) for the Internet of Things (IoT) applications is increasing day-by-day, especially with the emergence of smart city services. The sensor data streams generated from these applications are largely dynamic, heterogeneous, and often geographically distributed over large areas. For high-value use in business, industry and services, these data streams must be mined to extract insightful knowledge, such as about monitoring (e.g., discovering certain behaviors over a deployed area) or network diagnostics (e.g., predicting faulty sensor nodes). However, due to the inherent constraints of sensor networks and application requirements, traditional data mining techniques cannot be directly used to mine IoT data streams efficiently and accurately in real-time. In the last decade, a number of works have been reported in the literature proposing behavioral pattern mining algorithms for sensor networks. This paper presents the technical challenges that need to be considered for mining sensor data. It then provides a thorough review of the mining techniques proposed in the recent literature to mine behavioral patterns from sensor data in IoT, and their characteristics and differences are highlighted and compared. We also propose a behavioral pattern mining framework for IoT and discuss possible future research directions in this area. © 2013 IEEE.
Clusterwise support vector linear regression
- Joki, Kaisa, Bagirov, Adil, Karmitsa, Napsu, Mäkelä, Marko, Taheri, Sona
- Authors: Joki, Kaisa , Bagirov, Adil , Karmitsa, Napsu , Mäkelä, Marko , Taheri, Sona
- Date: 2020
- Type: Text , Journal article
- Relation: European Journal of Operational Research Vol. 287, no. 1 (2020), p. 19-35
- Full Text:
- Reviewed:
- Description: In clusterwise linear regression (CLR), the aim is to simultaneously partition data into a given number of clusters and to find regression coefficients for each cluster. In this paper, we propose a novel approach to model and solve the CLR problem. The main idea is to utilize the support vector machine (SVM) approach to model the CLR problem by using the SVM for regression to approximate each cluster. This new formulation of the CLR problem is represented as an unconstrained nonsmooth optimization problem, where we minimize a difference of two convex (DC) functions. To solve this problem, a method based on the combination of the incremental algorithm and the double bundle method for DC optimization is designed. Numerical experiments are performed to validate the reliability of the new formulation for CLR and the efficiency of the proposed method. The results show that the SVM approach is suitable for solving CLR problems, especially, when there are outliers in data. © 2020 Elsevier B.V.
- Description: Funding details: Academy of Finland, 289500, 294002, 319274 Funding details: Turun Yliopisto Funding details: Australian Research Council, ARC, (Project no. DP190100580 ).
- Authors: Joki, Kaisa , Bagirov, Adil , Karmitsa, Napsu , Mäkelä, Marko , Taheri, Sona
- Date: 2020
- Type: Text , Journal article
- Relation: European Journal of Operational Research Vol. 287, no. 1 (2020), p. 19-35
- Full Text:
- Reviewed:
- Description: In clusterwise linear regression (CLR), the aim is to simultaneously partition data into a given number of clusters and to find regression coefficients for each cluster. In this paper, we propose a novel approach to model and solve the CLR problem. The main idea is to utilize the support vector machine (SVM) approach to model the CLR problem by using the SVM for regression to approximate each cluster. This new formulation of the CLR problem is represented as an unconstrained nonsmooth optimization problem, where we minimize a difference of two convex (DC) functions. To solve this problem, a method based on the combination of the incremental algorithm and the double bundle method for DC optimization is designed. Numerical experiments are performed to validate the reliability of the new formulation for CLR and the efficiency of the proposed method. The results show that the SVM approach is suitable for solving CLR problems, especially, when there are outliers in data. © 2020 Elsevier B.V.
- Description: Funding details: Academy of Finland, 289500, 294002, 319274 Funding details: Turun Yliopisto Funding details: Australian Research Council, ARC, (Project no. DP190100580 ).
Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine
- Khraisat, Ansam, Gondal, Iqbal, Vamplew, Peter, Kamruzzaman, Joarder, Alazab, Ammar
- Authors: Khraisat, Ansam , Gondal, Iqbal , Vamplew, Peter , Kamruzzaman, Joarder , Alazab, Ammar
- Date: 2020
- Type: Text , Journal article
- Relation: Electronics (Switzerland) Vol. 9, no. 1 (2020), p.
- Full Text:
- Reviewed:
- Description: Cyberttacks are becoming increasingly sophisticated, necessitating the efficient intrusion detection mechanisms to monitor computer resources and generate reports on anomalous or suspicious activities. Many Intrusion Detection Systems (IDSs) use a single classifier for identifying intrusions. Single classifier IDSs are unable to achieve high accuracy and low false alarm rates due to polymorphic, metamorphic, and zero-day behaviors of malware. In this paper, a Hybrid IDS (HIDS) is proposed by combining the C5 decision tree classifier and One Class Support Vector Machine (OC-SVM). HIDS combines the strengths of SIDS) and Anomaly-based Intrusion Detection System (AIDS). The SIDS was developed based on the C5.0 Decision tree classifier and AIDS was developed based on the one-class Support Vector Machine (SVM). This framework aims to identify both the well-known intrusions and zero-day attacks with high detection accuracy and low false-alarm rates. The proposed HIDS is evaluated using the benchmark datasets, namely, Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) and Australian Defence Force Academy (ADFA) datasets. Studies show that the performance of HIDS is enhanced, compared to SIDS and AIDS in terms of detection rate and low false-alarm rates. © 2020 by the authors. Licensee MDPI, Basel, Switzerland.
- Authors: Khraisat, Ansam , Gondal, Iqbal , Vamplew, Peter , Kamruzzaman, Joarder , Alazab, Ammar
- Date: 2020
- Type: Text , Journal article
- Relation: Electronics (Switzerland) Vol. 9, no. 1 (2020), p.
- Full Text:
- Reviewed:
- Description: Cyberttacks are becoming increasingly sophisticated, necessitating the efficient intrusion detection mechanisms to monitor computer resources and generate reports on anomalous or suspicious activities. Many Intrusion Detection Systems (IDSs) use a single classifier for identifying intrusions. Single classifier IDSs are unable to achieve high accuracy and low false alarm rates due to polymorphic, metamorphic, and zero-day behaviors of malware. In this paper, a Hybrid IDS (HIDS) is proposed by combining the C5 decision tree classifier and One Class Support Vector Machine (OC-SVM). HIDS combines the strengths of SIDS) and Anomaly-based Intrusion Detection System (AIDS). The SIDS was developed based on the C5.0 Decision tree classifier and AIDS was developed based on the one-class Support Vector Machine (SVM). This framework aims to identify both the well-known intrusions and zero-day attacks with high detection accuracy and low false-alarm rates. The proposed HIDS is evaluated using the benchmark datasets, namely, Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) and Australian Defence Force Academy (ADFA) datasets. Studies show that the performance of HIDS is enhanced, compared to SIDS and AIDS in terms of detection rate and low false-alarm rates. © 2020 by the authors. Licensee MDPI, Basel, Switzerland.
Performance analysis of different types of machine learning classifiers for non-technical loss detection
- Ghori, Khawaja, Abbasi, Rabeeh, Awais, Muhammad, Imran, Muhammad, Ullah, Ata, Szathmary, Laszlo
- Authors: Ghori, Khawaja , Abbasi, Rabeeh , Awais, Muhammad , Imran, Muhammad , Ullah, Ata , Szathmary, Laszlo
- Date: 2020
- Type: Text , Journal article
- Relation: IEEE Access Vol. 8, no. (2020), p. 16033-16048
- Full Text:
- Reviewed:
- Description: With the ever-growing demand of electric power, it is quite challenging to detect and prevent Non-Technical Loss (NTL) in power industries. NTL is committed by meter bypassing, hooking from the main lines, reversing and tampering the meters. Manual on-site checking and reporting of NTL remains an unattractive strategy due to the required manpower and associated cost. The use of machine learning classifiers has been an attractive option for NTL detection. It enhances data-oriented analysis and high hit ratio along with less cost and manpower requirements. However, there is still a need to explore the results across multiple types of classifiers on a real-world dataset. This paper considers a real dataset from a power supply company in Pakistan to identify NTL. We have evaluated 15 existing machine learning classifiers across 9 types which also include the recently developed CatBoost, LGBoost and XGBoost classifiers. Our work is validated using extensive simulations. Results elucidate that ensemble methods and Artificial Neural Network (ANN) outperform the other types of classifiers for NTL detection in our real dataset. Moreover, we have also derived a procedure to identify the top-14 features out of a total of 71 features, which are contributing 77% in predicting NTL. We conclude that including more features beyond this threshold does not improve performance and thus limiting to the selected feature set reduces the computation time required by the classifiers. Last but not least, the paper also analyzes the results of the classifiers with respect to their types, which has opened a new area of research in NTL detection. © 2013 IEEE.
- Authors: Ghori, Khawaja , Abbasi, Rabeeh , Awais, Muhammad , Imran, Muhammad , Ullah, Ata , Szathmary, Laszlo
- Date: 2020
- Type: Text , Journal article
- Relation: IEEE Access Vol. 8, no. (2020), p. 16033-16048
- Full Text:
- Reviewed:
- Description: With the ever-growing demand of electric power, it is quite challenging to detect and prevent Non-Technical Loss (NTL) in power industries. NTL is committed by meter bypassing, hooking from the main lines, reversing and tampering the meters. Manual on-site checking and reporting of NTL remains an unattractive strategy due to the required manpower and associated cost. The use of machine learning classifiers has been an attractive option for NTL detection. It enhances data-oriented analysis and high hit ratio along with less cost and manpower requirements. However, there is still a need to explore the results across multiple types of classifiers on a real-world dataset. This paper considers a real dataset from a power supply company in Pakistan to identify NTL. We have evaluated 15 existing machine learning classifiers across 9 types which also include the recently developed CatBoost, LGBoost and XGBoost classifiers. Our work is validated using extensive simulations. Results elucidate that ensemble methods and Artificial Neural Network (ANN) outperform the other types of classifiers for NTL detection in our real dataset. Moreover, we have also derived a procedure to identify the top-14 features out of a total of 71 features, which are contributing 77% in predicting NTL. We conclude that including more features beyond this threshold does not improve performance and thus limiting to the selected feature set reduces the computation time required by the classifiers. Last but not least, the paper also analyzes the results of the classifiers with respect to their types, which has opened a new area of research in NTL detection. © 2013 IEEE.
Accurate and efficient clustering algorithms for very large data sets
- Authors: Quddus, Syed
- Date: 2017
- Type: Text , Thesis , PhD
- Full Text:
- Description: The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.
- Description: Doctor of Philosophy
- Authors: Quddus, Syed
- Date: 2017
- Type: Text , Thesis , PhD
- Full Text:
- Description: The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.
- Description: Doctor of Philosophy
An efficient data extraction framework for mining wireless sensor networks
- Rashid, Md. Mamunur, Gondal, Iqbal, Kamruzzaman, Joarder
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2016
- Type: Text , Conference paper
- Relation: 23rd International Conference, ICONIP 2016; Kyoto, Japan; 16th-21st October 2016; published in Neural Information Processing, Part III (Lecture Notes in Computer Science series) Vol. 9949, p. 491-498
- Full Text:
- Reviewed:
- Description: Behavioral patterns for sensors have received a great deal of attention recently due to their usefulness in capturing the temporal relations between sensors in wireless sensor networks. To discover these patterns, we need to collect the behavioral data that represents the sensor's activities over time from the sensor database that attached with a well-equipped central node called sink for further analysis. However, given the limited resources of sensor nodes, an effective data collection method is required for collecting the behavioral data efficiently. In this paper, we introduce a new framework for behavioral patterns called associated-correlated sensor patterns and also propose a MapReduce based new paradigm for extract data from the wireless sensor network by distributed away. Extensive performance study shows that the proposed method is capable to reduce the data size almost 50% compared to the centralized model.
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2016
- Type: Text , Conference paper
- Relation: 23rd International Conference, ICONIP 2016; Kyoto, Japan; 16th-21st October 2016; published in Neural Information Processing, Part III (Lecture Notes in Computer Science series) Vol. 9949, p. 491-498
- Full Text:
- Reviewed:
- Description: Behavioral patterns for sensors have received a great deal of attention recently due to their usefulness in capturing the temporal relations between sensors in wireless sensor networks. To discover these patterns, we need to collect the behavioral data that represents the sensor's activities over time from the sensor database that attached with a well-equipped central node called sink for further analysis. However, given the limited resources of sensor nodes, an effective data collection method is required for collecting the behavioral data efficiently. In this paper, we introduce a new framework for behavioral patterns called associated-correlated sensor patterns and also propose a MapReduce based new paradigm for extract data from the wireless sensor network by distributed away. Extensive performance study shows that the proposed method is capable to reduce the data size almost 50% compared to the centralized model.
Economic resilience of regions under crises : A study of the Australian economy
- Courvisanos, Jerry, Jain, Ameeta, Mardaneh, Karim
- Authors: Courvisanos, Jerry , Jain, Ameeta , Mardaneh, Karim
- Date: 2016
- Type: Text , Journal article
- Relation: Regional Studies Vol. 50, no. 4 (2016), p. 629-643
- Full Text:
- Reviewed:
- Description: Economic resilience of regions under crises: a study of the Australian economy, Regional Studies. Identifying patterns of economic resilience in regions by industry categories is the focus of this paper. Patterns emerge from adaptive capacity in four distinct functional groups of local government regions in Australia, in respect of their resilience from shocks on specific industries. A model of regional adaptive cycles around four sequential phases - reorganization, exploitation, conservation and release - is adopted as the framework for recognizing such patterns. A data-mining method utilizes a k-means algorithm to evaluate the impact of two major shocks - a 13-year drought and the Global Financial Crisis - on four functional groups of regions, using census data from 2001, 2006 and 2011. © 2015 Regional Studies Association.
- Authors: Courvisanos, Jerry , Jain, Ameeta , Mardaneh, Karim
- Date: 2016
- Type: Text , Journal article
- Relation: Regional Studies Vol. 50, no. 4 (2016), p. 629-643
- Full Text:
- Reviewed:
- Description: Economic resilience of regions under crises: a study of the Australian economy, Regional Studies. Identifying patterns of economic resilience in regions by industry categories is the focus of this paper. Patterns emerge from adaptive capacity in four distinct functional groups of local government regions in Australia, in respect of their resilience from shocks on specific industries. A model of regional adaptive cycles around four sequential phases - reorganization, exploitation, conservation and release - is adopted as the framework for recognizing such patterns. A data-mining method utilizes a k-means algorithm to evaluate the impact of two major shocks - a 13-year drought and the Global Financial Crisis - on four functional groups of regions, using census data from 2001, 2006 and 2011. © 2015 Regional Studies Association.
A complete list of kernels used in support vector machines
- Authors: Zhang, Jiapu
- Date: 2015
- Type: Text , Journal article
- Relation: Biochemistry & Pharmacology Vol. 4, no. 5 (2015), p. 1-2
- Full Text:
- Reviewed:
- Description: In bioinformatics or chemoinformatics, we always need data mining of support vector machines (SVMs) for its large databases. Kernels play an important role in SVMs. Thus it is very necessary to list all the kernels of SVMs that we currently use.
- Authors: Zhang, Jiapu
- Date: 2015
- Type: Text , Journal article
- Relation: Biochemistry & Pharmacology Vol. 4, no. 5 (2015), p. 1-2
- Full Text:
- Reviewed:
- Description: In bioinformatics or chemoinformatics, we always need data mining of support vector machines (SVMs) for its large databases. Kernels play an important role in SVMs. Thus it is very necessary to list all the kernels of SVMs that we currently use.
A technique for parallel share-frequent sensor pattern mining from wireless sensor networks
- Rashid, Md. Mamunur, Gondal, Iqbal, Kamruzzaman, Joarder
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2014
- Type: Text , Conference paper
- Relation: 14th Annual International Conference on Computational Science, ICCS 2014; Cairns, Australia; 10th-12th June 2014; published in Procedia Computer Science p. 124-133
- Full Text:
- Reviewed:
- Description: WSNs generate huge amount of data in the form of streams and mining useful knowledge from these streams is a challenging task. Existing works generate sensor association rules using occurrence frequency of patterns with binary frequency (either absent or present) or support of a pattern as a criterion. However, considering the binary frequency or support of a pattern may not be a sufficient indicator for finding meaningful patterns from WSN data because it only reflects the number of epochs in the sensor data which contain that pattern. The share measure of sensorsets could discover useful knowledge about numerical values associated with sensor in a sensor database. Therefore, in this paper, we propose a new type of behavioral pattern called share-frequent sensor patterns by considering the non-binary frequency values of sensors in epochs. To discover share-frequent sensor patterns from sensor dataset, we propose a novel parallel technique. In this technique, we develop a novel tree structure, called parallel share-frequent sensor pattern tree (PShrFSP-tree) that is constructed at each local node independently, by capturing the database contents to generate the candidate patterns using a pattern growth technique with a single scan and then merges the locally generated candidate patterns at the final stage to generate global share-frequent sensor patterns. Comprehensive experimental results show that our proposed model is very efficient for mining share-frequent patterns from WSN data in terms of time and scalability.
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2014
- Type: Text , Conference paper
- Relation: 14th Annual International Conference on Computational Science, ICCS 2014; Cairns, Australia; 10th-12th June 2014; published in Procedia Computer Science p. 124-133
- Full Text:
- Reviewed:
- Description: WSNs generate huge amount of data in the form of streams and mining useful knowledge from these streams is a challenging task. Existing works generate sensor association rules using occurrence frequency of patterns with binary frequency (either absent or present) or support of a pattern as a criterion. However, considering the binary frequency or support of a pattern may not be a sufficient indicator for finding meaningful patterns from WSN data because it only reflects the number of epochs in the sensor data which contain that pattern. The share measure of sensorsets could discover useful knowledge about numerical values associated with sensor in a sensor database. Therefore, in this paper, we propose a new type of behavioral pattern called share-frequent sensor patterns by considering the non-binary frequency values of sensors in epochs. To discover share-frequent sensor patterns from sensor dataset, we propose a novel parallel technique. In this technique, we develop a novel tree structure, called parallel share-frequent sensor pattern tree (PShrFSP-tree) that is constructed at each local node independently, by capturing the database contents to generate the candidate patterns using a pattern growth technique with a single scan and then merges the locally generated candidate patterns at the final stage to generate global share-frequent sensor patterns. Comprehensive experimental results show that our proposed model is very efficient for mining share-frequent patterns from WSN data in terms of time and scalability.
Novel data mining techniques for incompleted clinical data in diabetes management
- Jelinek, Herbert, Yatsko, Andrew, Stranieri, Andrew, Venkatraman, Sitalakshmi
- Authors: Jelinek, Herbert , Yatsko, Andrew , Stranieri, Andrew , Venkatraman, Sitalakshmi
- Date: 2014
- Type: Text , Journal article
- Relation: British Journal of Applied Science & Technology Vol. 4, no. 33 (2014), p. 4591-4606
- Relation: https://doi.org/10.9734/BJAST/2014/11744
- Full Text:
- Reviewed:
- Description: An important part of health care involves upkeep and interpretation of medical databases containing patient records for clinical decision making, diagnosis and follow-up treatment. Missing clinical entries make it difficult to apply data mining algorithms for clinical decision support. This study demonstrates that higher predictive accuracy is possible using conventional data mining algorithms if missing values are dealt with appropriately. We propose a novel algorithm using a convolution of sub-problems to stage a super problem, where classes are defined by Cartesian Product of class values of the underlying problems, and Incomplete Information Dismissal and Data Completion techniques are applied for reducing features and imputing missing values. Predictive accuracies using Decision Branch, Nearest Neighborhood and Naïve Bayesian classifiers were compared to predict diabetes, cardiovascular disease and hypertension. Data is derived from Diabetes Screening Complications Research Initiative (DiScRi) conducted at a regional Australian university involving more than 2400 patient records with more than one hundred clinical risk factors (attributes). The results show substantial improvements in the accuracy achieved with each classifier for an effective diagnosis of diabetes, cardiovascular disease and hypertension as compared to those achieved without substituting missing values. The gain in improvement is 7% for diabetes, 21% for cardiovascular disease and 24% for hypertension, and our integrated novel approach has resulted in more than 90% accuracy for the diagnosis of any of the three conditions. This work advances data mining research towards achieving an integrated and holistic management of diabetes. - See more at: http://www.sciencedomain.org/abstract.php?iid=670&id=5&aid=6128#.VCSxDfmSx8E
- Authors: Jelinek, Herbert , Yatsko, Andrew , Stranieri, Andrew , Venkatraman, Sitalakshmi
- Date: 2014
- Type: Text , Journal article
- Relation: British Journal of Applied Science & Technology Vol. 4, no. 33 (2014), p. 4591-4606
- Relation: https://doi.org/10.9734/BJAST/2014/11744
- Full Text:
- Reviewed:
- Description: An important part of health care involves upkeep and interpretation of medical databases containing patient records for clinical decision making, diagnosis and follow-up treatment. Missing clinical entries make it difficult to apply data mining algorithms for clinical decision support. This study demonstrates that higher predictive accuracy is possible using conventional data mining algorithms if missing values are dealt with appropriately. We propose a novel algorithm using a convolution of sub-problems to stage a super problem, where classes are defined by Cartesian Product of class values of the underlying problems, and Incomplete Information Dismissal and Data Completion techniques are applied for reducing features and imputing missing values. Predictive accuracies using Decision Branch, Nearest Neighborhood and Naïve Bayesian classifiers were compared to predict diabetes, cardiovascular disease and hypertension. Data is derived from Diabetes Screening Complications Research Initiative (DiScRi) conducted at a regional Australian university involving more than 2400 patient records with more than one hundred clinical risk factors (attributes). The results show substantial improvements in the accuracy achieved with each classifier for an effective diagnosis of diabetes, cardiovascular disease and hypertension as compared to those achieved without substituting missing values. The gain in improvement is 7% for diabetes, 21% for cardiovascular disease and 24% for hypertension, and our integrated novel approach has resulted in more than 90% accuracy for the diagnosis of any of the three conditions. This work advances data mining research towards achieving an integrated and holistic management of diabetes. - See more at: http://www.sciencedomain.org/abstract.php?iid=670&id=5&aid=6128#.VCSxDfmSx8E
Classification systems based on combinatorial semigroups
- Abawajy, Jemal, Kelarev, Andrei
- Authors: Abawajy, Jemal , Kelarev, Andrei
- Date: 2013
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 86, no. 3 (2013), p. 603-612
- Full Text:
- Reviewed:
- Description: The present article continues the investigation of constructions essential for applications of combinatorial semigroups to the design of multiple classification systems in data mining. Our main theorem gives a complete description of all optimal classification systems defined by one-sided ideals in a construction based on combinatorial Rees matrix semigroups. It strengthens and generalizes previous results, which handled the more narrow case of two-sided ideals. © 2012 Springer Science+Business Media New York.
- Description: 2003011021
- Authors: Abawajy, Jemal , Kelarev, Andrei
- Date: 2013
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 86, no. 3 (2013), p. 603-612
- Full Text:
- Reviewed:
- Description: The present article continues the investigation of constructions essential for applications of combinatorial semigroups to the design of multiple classification systems in data mining. Our main theorem gives a complete description of all optimal classification systems defined by one-sided ideals in a construction based on combinatorial Rees matrix semigroups. It strengthens and generalizes previous results, which handled the more narrow case of two-sided ideals. © 2012 Springer Science+Business Media New York.
- Description: 2003011021
Rule-based classifiers and meta classifiers for identification of cardiac autonomic neuropathy progression
- Jelinek, Herbert, Kelarev, Andrei, Stranieri, Andrew, Yearwood, John
- Authors: Jelinek, Herbert , Kelarev, Andrei , Stranieri, Andrew , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: International Journal of Information Science and Computer Mathematics Vol. 5, no. 2 (2012), p. 49-53
- Full Text:
- Reviewed:
- Description: We investigate and compare several rule-based classifiers and meta classifiers in their ability to obtain multi-class classifications of cardiac autonomic neuropathy (CAN) and its progression. The best results obtained in our experiments are significantly better than the outcomes published previously in the literature for analogous CAN identification tasks or simpler binary classification tasks.
- Authors: Jelinek, Herbert , Kelarev, Andrei , Stranieri, Andrew , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: International Journal of Information Science and Computer Mathematics Vol. 5, no. 2 (2012), p. 49-53
- Full Text:
- Reviewed:
- Description: We investigate and compare several rule-based classifiers and meta classifiers in their ability to obtain multi-class classifications of cardiac autonomic neuropathy (CAN) and its progression. The best results obtained in our experiments are significantly better than the outcomes published previously in the literature for analogous CAN identification tasks or simpler binary classification tasks.
Structure learning of Bayesian networks using a new unrestricted dependency algorithm
- Taheri, Sona, Mammadov, Musa
- Authors: Taheri, Sona , Mammadov, Musa
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Bayesian Networks have deserved extensive attentions in data mining due to their efficiencies, and reasonable predictive accuracy. A Bayesian Network is a directed acyclic graph in which each node represents a variable and each arc a probabilistic dependency between two variables. Constructing a Bayesian Network from data is the learning process that is divided in two steps: learning structure and learning parameter. In many domains, the structure is not known a priori and must be inferred from data. This paper presents an iterative unrestricted dependency algorithm for learning structure of Bayesian Networks for binary classification problems. Numerical experiments are conducted on several real world data sets, where continuous features are discretized by applying two different methods. The performance of the proposed algorithm is compared with the Naive Bayes, the Tree Augmented Naive Bayes, and the k
- Authors: Taheri, Sona , Mammadov, Musa
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Bayesian Networks have deserved extensive attentions in data mining due to their efficiencies, and reasonable predictive accuracy. A Bayesian Network is a directed acyclic graph in which each node represents a variable and each arc a probabilistic dependency between two variables. Constructing a Bayesian Network from data is the learning process that is divided in two steps: learning structure and learning parameter. In many domains, the structure is not known a priori and must be inferred from data. This paper presents an iterative unrestricted dependency algorithm for learning structure of Bayesian Networks for binary classification problems. Numerical experiments are conducted on several real world data sets, where continuous features are discretized by applying two different methods. The performance of the proposed algorithm is compared with the Naive Bayes, the Tree Augmented Naive Bayes, and the k
Efficient piecewise linear classifiers and applications
- Authors: Webb, Dean
- Date: 2011
- Type: Text , Thesis , PhD
- Full Text:
- Description: Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.
- Description: Doctor of Philosophy
- Authors: Webb, Dean
- Date: 2011
- Type: Text , Thesis , PhD
- Full Text:
- Description: Supervised learning has become an essential part of data mining for industry, military, science and academia. Classification, a type of supervised learning allows a machine to learn from data to then predict certain behaviours, variables or outcomes. Classification can be used to solve many problems including the detection of malignant cancers, potentially bad creditors and even enabling autonomy in robots. The ability to collect and store large amounts of data has increased significantly over the past few decades. However, the ability of classification techniques to deal with large scale data has not been matched. Many data transformation and reduction schemes have been tried with mixed success. This problem is further exacerbated when dealing with real time classification in embedded systems. The real time classifier must classify using only limited processing, memory and power resources. Piecewise linear boundaries are known to provide efficient real time classifiers. They have low memory requirements, require little processing effort, are parameterless and classify in real time. Piecewise linear functions are used to approximate non-linear decision boundaries between pattern classes. Finding these piecewise linear boundaries is a difficult optimization problem that can require a long training time. Multiple optimization approaches have been used for real time classification, but can lead to suboptimal piecewise linear boundaries. This thesis develops three real time piecewise linear classifiers that deal with large scale data. Each classifier uses a single optimization algorithm in conjunction with an incremental approach that reduces the number of points as the decision boundaries are built. Two of the classifiers further reduce complexity by augmenting the incremental approach with additional schemes. One scheme uses hyperboxes to identify points inside the so-called “indeterminate” regions. The other uses a polyhedral conic set to identify data points lying on or close to the boundary. All other points are excluded from the process of building the decision boundaries. The three classifiers are applied to real time data classification problems and the results of numerical experiments on real world data sets are reported. These results demonstrate that the new classifiers require a reasonable training time and their test set accuracy is consistently good on most data sets compared with current state of the art classifiers.
- Description: Doctor of Philosophy
Optimization of classifiers for data mining based on combinatorial semigroups
- Kelarev, Andrei, Yearwood, John, Watters, Paul
- Authors: Kelarev, Andrei , Yearwood, John , Watters, Paul
- Date: 2011
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 82, no. 2 (2011), p. 1-10
- Full Text:
- Reviewed:
- Description: The aim of the present article is to obtain a theoretical result essential for applications of combinatorial semigroups for the design of multiple classification systems in data mining. We consider a novel construction of multiple classification systems, or classifiers, combining several binary classifiers. The construction is based on combinatorial Rees matrix semigroups without any restrictions on the sandwich-matrix. Our main theorem gives a complete description of all optimal classifiers in this novel construction. © 2011 Springer Science+Business Media, LLC.
- Authors: Kelarev, Andrei , Yearwood, John , Watters, Paul
- Date: 2011
- Type: Text , Journal article
- Relation: Semigroup Forum Vol. 82, no. 2 (2011), p. 1-10
- Full Text:
- Reviewed:
- Description: The aim of the present article is to obtain a theoretical result essential for applications of combinatorial semigroups for the design of multiple classification systems in data mining. We consider a novel construction of multiple classification systems, or classifiers, combining several binary classifiers. The construction is based on combinatorial Rees matrix semigroups without any restrictions on the sandwich-matrix. Our main theorem gives a complete description of all optimal classifiers in this novel construction. © 2011 Springer Science+Business Media, LLC.
Zero-day malware detection based on supervised learning algorithms of API call signatures
- Alazab, Mamoun, Venkatraman, Sitalakshmi, Watters, Paul, Alazab, Moutaz
- Authors: Alazab, Mamoun , Venkatraman, Sitalakshmi , Watters, Paul , Alazab, Moutaz
- Date: 2011
- Type: Text , Conference proceedings
- Full Text:
- Description: Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code to produce offspring copies which have the same functionality but with different signatures. Current techniques reported in literature lack the capability of detecting zero-day malware with the required accuracy and efficiency. In this paper, we have proposed and evaluated a novel method of employing several data mining techniques to detect and classify zero-day malware with high levels of accuracy and efficiency based on the frequency of Windows API calls. This paper describes the methodology employed for the collection of large data sets to train the classifiers, and analyses the performance results of the various data mining algorithms adopted for the study using a fully automated tool developed in this research to conduct the various experimental investigations and evaluation. Through the performance results of these algorithms from our experimental analysis, we are able to evaluate and discuss the advantages of one data mining algorithm over the other for accurately detecting zero-day malware successfully. The data mining framework employed in this research learns through analysing the behavior of existing malicious and benign codes in large datasets. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k-Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO - Normalized PolyKernel, SMO - PolyKernel, SMO - Puk, and SMO- Radial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance. Overall, the automated data mining system implemented for this study has achieved high true positive (TP) rate of more than 98.5%, and low false positive (FP) rate of less than 0.025, which has not been achieved in literature so far. This is much higher than the required commercial acceptance level indicating that our novel technique is a major leap forward in detecting zero-day malware. This paper also offers future directions for researchers in exploring different aspects of obfuscations that are affecting the IT world today. © 2011, Australian Computer Society, Inc.
- Description: 2003009506
- Authors: Alazab, Mamoun , Venkatraman, Sitalakshmi , Watters, Paul , Alazab, Moutaz
- Date: 2011
- Type: Text , Conference proceedings
- Full Text:
- Description: Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code to produce offspring copies which have the same functionality but with different signatures. Current techniques reported in literature lack the capability of detecting zero-day malware with the required accuracy and efficiency. In this paper, we have proposed and evaluated a novel method of employing several data mining techniques to detect and classify zero-day malware with high levels of accuracy and efficiency based on the frequency of Windows API calls. This paper describes the methodology employed for the collection of large data sets to train the classifiers, and analyses the performance results of the various data mining algorithms adopted for the study using a fully automated tool developed in this research to conduct the various experimental investigations and evaluation. Through the performance results of these algorithms from our experimental analysis, we are able to evaluate and discuss the advantages of one data mining algorithm over the other for accurately detecting zero-day malware successfully. The data mining framework employed in this research learns through analysing the behavior of existing malicious and benign codes in large datasets. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k-Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO - Normalized PolyKernel, SMO - PolyKernel, SMO - Puk, and SMO- Radial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance. Overall, the automated data mining system implemented for this study has achieved high true positive (TP) rate of more than 98.5%, and low false positive (FP) rate of less than 0.025, which has not been achieved in literature so far. This is much higher than the required commercial acceptance level indicating that our novel technique is a major leap forward in detecting zero-day malware. This paper also offers future directions for researchers in exploring different aspects of obfuscations that are affecting the IT world today. © 2011, Australian Computer Society, Inc.
- Description: 2003009506
Application of optimisation-based data mining techniques to tobacco control dataset
- Dzalilov, Zari, Zhang, J, Bagirov, Adil, Mammadov, Musa
- Authors: Dzalilov, Zari , Zhang, J , Bagirov, Adil , Mammadov, Musa
- Date: 2010
- Type: Text , Journal article
- Relation: International Journal of Lean Thinking Vol. 1, no. 1 (2010), p. 27-41
- Full Text: false
- Reviewed:
- Description: Tobacco smoking is one of the leading causes of death around the world. Consequently, control of tobacco use is an important global public health issue. Tobacco control may be aided by development of theoretical and methodological frameworks for describing and understanding complex tobacco control systems. Linear regression and logistic regression are currently very popular statistical techniques for modeling and analyzing complex data in tobacco control systems. However, in tobacco markets, numerous interrelated factors nontrivially interact with tobacco control policies, such that policies and control outcomes are nonlinearly related.
- Authors: Dzalilov, Zari , Zhang, J , Bagirov, Adil , Mammadov, Musa
- Date: 2010
- Type: Text , Journal article
- Relation: International Journal of Lean Thinking Vol. 1, no. 1 (2010), p. 27-41
- Full Text: false
- Reviewed:
- Description: Tobacco smoking is one of the leading causes of death around the world. Consequently, control of tobacco use is an important global public health issue. Tobacco control may be aided by development of theoretical and methodological frameworks for describing and understanding complex tobacco control systems. Linear regression and logistic regression are currently very popular statistical techniques for modeling and analyzing complex data in tobacco control systems. However, in tobacco markets, numerous interrelated factors nontrivially interact with tobacco control policies, such that policies and control outcomes are nonlinearly related.
Automatic sleep stage identification: difficulties and possible solutions
- Sukhorukova, Nadezda, Stranieri, Andrew, Ofoghi, Bahadorreza, Vamplew, Peter, Saleem, Muhammad Saad, Ma, Liping, Ugon, Adrien, Ugon, Julien, Muecke, Nial, Amiel, Hélène, Philippe, Carole, Bani-Mustafa, Ahmed, Huda, Shamsul, Bertoli, Marcello, Levy, P, Ganascia, J.G
- Authors: Sukhorukova, Nadezda , Stranieri, Andrew , Ofoghi, Bahadorreza , Vamplew, Peter , Saleem, Muhammad Saad , Ma, Liping , Ugon, Adrien , Ugon, Julien , Muecke, Nial , Amiel, Hélène , Philippe, Carole , Bani-Mustafa, Ahmed , Huda, Shamsul , Bertoli, Marcello , Levy, P , Ganascia, J.G
- Date: 2010
- Type: Text , Conference proceedings
- Full Text:
- Description: The diagnosis of many sleep disorders is a labour intensive task that involves the specialised interpretation of numerous signals including brain wave, breath and heart rate captured in overnight polysomnogram sessions. The automation of diagnoses is challenging for data mining algorithms because the data sets are extremely large and noisy, the signals are complex and specialist's analyses vary. This work reports on the adaptation of approaches from four fields; neural networks, mathematical optimisation, financial forecasting and frequency domain analysis to the problem of automatically determing a patient's stage of sleep. Results, though preliminary, are promising and indicate that combined approaches may prove more fruitful than the reliance on a approach.
- Authors: Sukhorukova, Nadezda , Stranieri, Andrew , Ofoghi, Bahadorreza , Vamplew, Peter , Saleem, Muhammad Saad , Ma, Liping , Ugon, Adrien , Ugon, Julien , Muecke, Nial , Amiel, Hélène , Philippe, Carole , Bani-Mustafa, Ahmed , Huda, Shamsul , Bertoli, Marcello , Levy, P , Ganascia, J.G
- Date: 2010
- Type: Text , Conference proceedings
- Full Text:
- Description: The diagnosis of many sleep disorders is a labour intensive task that involves the specialised interpretation of numerous signals including brain wave, breath and heart rate captured in overnight polysomnogram sessions. The automation of diagnoses is challenging for data mining algorithms because the data sets are extremely large and noisy, the signals are complex and specialist's analyses vary. This work reports on the adaptation of approaches from four fields; neural networks, mathematical optimisation, financial forecasting and frequency domain analysis to the problem of automatically determing a patient's stage of sleep. Results, though preliminary, are promising and indicate that combined approaches may prove more fruitful than the reliance on a approach.
From convex to nonconvex: A loss function analysis for binary classification
- Zhao, Lei, Mammadov, Musa, Yearwood, John
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.