Categorical features transformation with compact one-hot encoder for fraud detection in distributed environment
- Authors: Ul Haq, Ikram , Gondal, Iqbal , Vamplew, Peter , Brown, Simon
- Date: 2019
- Type: Text , Conference proceedings , Conference paper
- Relation: 2019 16th Australasian Conference on Data Mining, AusDM 2018; Bathurst, NSW; 28 November 2018 through 30 November 2018 Vol. 996, p. 69-80
- Full Text: false
- Reviewed:
- Description: Fraud detection for online banking is an important research area, but one of the challenges is the heterogeneous nature of transactions data i.e. a combination of numeric as well as mixed attributes. Usually, numeric format data gives better performance for classification, regression and clustering algorithms. However, many machine learning problems have categorical, or nominal features, rather than numeric features only. In addition, some machine learning platforms such as Apache Spark accept numeric data only. One-hot Encoding (OHE) is a widely used approach for transforming categorical features to numerical features in traditional data mining tasks. The one-hot approach has some challenges as well: the sparseness of the transformed data and that the distinct values of an attribute are not always known in advance. Other than the model accuracy, compactness of machine learning models is equally important due to growing memory and storage needs. This paper presents an innovative technique to transform categorical features to numeric features by compacting sparse data even if all the distinct values are not known. The transformed data can be used for the development of fraud detection systems. The accuracy of the results has been validated on synthetic and real bank fraud data and a publicly available anomaly detection (KDD-99) dataset on a multi-node data cluster. © Springer Nature Singapore Pte Ltd. 2019.
Sliding window-based regularly frequent patterns mining over sensor data streams
- Authors: Rashid, Md Mamunur , Kamruzzaman, Joarder , Wasimi, Saleh
- Date: 2019
- Type: Text , Conference paper
- Relation: 2019 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2019, Melbourne, 9 December 2019 through 11 December 2019
- Full Text: false
- Reviewed:
- Description: WSNs generate a large amount of data in the form of data stream; and mining these streams with well used support metric-based sensor association rule mechanism can result in extracting interesting knowledge. Support metric-based sensor association use occurrence frequency of pattern as criteria, but the occurrence frequency of a pattern may not be an appropriate criterion for finding significant patterns. However, temporal regularity in occurrence behavior can be considered as another important measure for assessing the importance of patterns in WSNs. A frequent pattern that occurs after regular intervals in WSNs called as regularly frequent sensor patterns. Even though mining regularly frequent sensor patterns from sensor data stream is extremely required in real-time applications, no such algorithm has been proposed yet. Therefore, in this paper we propose a novel tree structure, called regular frequent sensor pattern stream tree (RFSPS-tree) and a new technique, called regularly frequent sensor pattern mining of data stream (RFSPMS), using sliding window-based regularly frequent sensor pattern mining for WSNs. By capturing the useful knowledge of the data stream into an RFSPS-tree, our RFSPMS algorithm can mine associated sensor patterns in the current window with frequent pattern (FP)-growth like pattern-growth method. Extensive experimental analyses show that our technique is very efficient in discovering regularly frequent sensor patterns over sensor data stream. © 2019 IEEE.
An efficient data extraction framework for mining wireless sensor networks
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2016
- Type: Text , Conference paper
- Relation: 23rd International Conference, ICONIP 2016; Kyoto, Japan; 16th-21st October 2016; published in Neural Information Processing, Part III (Lecture Notes in Computer Science series) Vol. 9949, p. 491-498
- Full Text:
- Reviewed:
- Description: Behavioral patterns for sensors have received a great deal of attention recently due to their usefulness in capturing the temporal relations between sensors in wireless sensor networks. To discover these patterns, we need to collect the behavioral data that represents the sensor's activities over time from the sensor database that attached with a well-equipped central node called sink for further analysis. However, given the limited resources of sensor nodes, an effective data collection method is required for collecting the behavioral data efficiently. In this paper, we introduce a new framework for behavioral patterns called associated-correlated sensor patterns and also propose a MapReduce based new paradigm for extract data from the wireless sensor network by distributed away. Extensive performance study shows that the proposed method is capable to reduce the data size almost 50% compared to the centralized model.
A mapreduce based technique for mining behavioral patterns from sensor data
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2015
- Type: Text , Conference paper
- Relation: 22nd International Conference on Neural Information Processing, ICONIP 2015; Istanbul, Turkey; 9th-12th November 2015 Vol. 9492, p. 145-153
- Full Text: false
- Reviewed:
- Description: WSNs generate a large amount of data in the form of streams, and temporal regularity in occurrence behavior is considered as an important measure for assessing the importance of patterns in WSN data. A frequent sensor pattern that occurs after regular intervals in WSNs is called regularly frequent sensor patterns (RFSPs). Existing RFSPs techniques assume that the data structure of the mining task is small enough to fit in the main memory of a processor. However, given the emergence of the Internet of Things (IoT), WSNs in future will generate huge volume of data, which means such an assumption does not hold any longer. To overcome this, a distributed solution using MapReduce model has not yet been explored extensively. Since MapReduce is becoming the de-facto model for computation on large data, an efficient RFSPs mining algorithm on this model is likely to provide a highly effective solution. In this work, we propose a regularly frequent sensor patterns mining algorithm called RFSP-H which uses MapReduce based framework. Extensive performance analyses show that our technique is significantly time efficient in finding regularly frequent sensor patterns. © Springer International Publishing Switzerland 2015.
Mining malware to detect variants
- Authors: Azab, Ahmad , Layton, Robert , Alazab, Mamoun , Oliver, Jonathan
- Date: 2015
- Type: Text , Conference paper
- Relation: 5th Cybercrime and Trustworthy Computing Conference, CTC 2014; Aukland, New Zealand; 24th-25th November 2014 p. 44-53
- Full Text: false
- Reviewed:
- Description: Cybercrime continues to be a growing challenge and malware is one of the most serious security threats on the Internet today which have been in existence from the very early days. Cyber criminals continue to develop and advance their malicious attacks. Unfortunately, existing techniques for detecting malware and analysing code samples are insufficient and have significant limitations. For example, most of malware detection studies focused only on detection and neglected the variants of the code. Investigating malware variants allows antivirus products and governments to more easily detect these new attacks, attribution, predict such or similar attacks in the future, and further analysis. The focus of this paper is performing similarity measures between different malware binaries for the same variant utilizing data mining concepts in conjunction with hashing algorithms. In this paper, we investigate and evaluate using the Trend Locality Sensitive Hashing (TLSH) algorithm to group binaries that belong to the same variant together, utilizing the k-NN algorithm. Two Zeus variants were tested, TSPY-ZBOT and MAL-ZBOT to address the effectiveness of the proposed approach. We compare TLSH to related hashing methods (SSDEEP, SDHASH and NILSIMSA) that are currently used for this purpose. Experimental evaluation demonstrates that our method can effectively detect variants of malware and resilient to common obfuscations used by cyber criminals. Our results show that TLSH and SDHASH provide the highest accuracy results in scoring an F-measure of 0.989 and 0.999 respectively. © 2014 IEEE.
A novel algorithm for mining behavioral patterns from wireless sensor networks
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2014
- Type: Text , Conference paper
- Relation: 2014 International Joint Conference on Neural Networks, IJCNN 2014; Beijing, China; 6th-11th July 2014 p. 1-7
- Full Text: false
- Reviewed:
- Description: Due to recent advances in wireless sensor networks (WSNs) and their ability to generate huge amount of data in the form of streams, knowledge discovery techniques have received a great deal of attention to extract useful knowledge regarding the underlying network. Traditionally sensor association rules measure occurrence frequency of patterns. However, these rules often generate a huge number of rules, most of which are non-informative or fail to reflect the true correlation among data objects. In this paper, we propose a new type of sensor behavioral pattern called associated sensor patterns that captures association-like co-occurrences and the strong temporal correlations implied by such co-occurrences in the sensor data. We also propose a novel tree structure called as associated sensor pattern tree (ASPT) and a mining algorithm, associated sensor pattern (ASP) which facilitates frequent pattern (FP) growth-based technique to generate all associated sensor patterns from WSN data with only one scan over the sensor database. Extensive performance study shows that our algorithm is very efficient in finding associated sensor patterns than the existing significant algorithms.
A technique for parallel share-frequent sensor pattern mining from wireless sensor networks
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2014
- Type: Text , Conference paper
- Relation: 14th Annual International Conference on Computational Science, ICCS 2014; Cairns, Australia; 10th-12th June 2014; published in Procedia Computer Science p. 124-133
- Full Text:
- Reviewed:
- Description: WSNs generate huge amount of data in the form of streams and mining useful knowledge from these streams is a challenging task. Existing works generate sensor association rules using occurrence frequency of patterns with binary frequency (either absent or present) or support of a pattern as a criterion. However, considering the binary frequency or support of a pattern may not be a sufficient indicator for finding meaningful patterns from WSN data because it only reflects the number of epochs in the sensor data which contain that pattern. The share measure of sensorsets could discover useful knowledge about numerical values associated with sensor in a sensor database. Therefore, in this paper, we propose a new type of behavioral pattern called share-frequent sensor patterns by considering the non-binary frequency values of sensors in epochs. To discover share-frequent sensor patterns from sensor dataset, we propose a novel parallel technique. In this technique, we develop a novel tree structure, called parallel share-frequent sensor pattern tree (PShrFSP-tree) that is constructed at each local node independently, by capturing the database contents to generate the candidate patterns using a pattern growth technique with a single scan and then merges the locally generated candidate patterns at the final stage to generate global share-frequent sensor patterns. Comprehensive experimental results show that our proposed model is very efficient for mining share-frequent patterns from WSN data in terms of time and scalability.
Predicting Hot-Spots in distributed cloud databases using association rule mining
- Authors: Kaml, Joarder , Murshed, Manzur , Gaber, Mohamed
- Date: 2014
- Type: Text , Conference paper
- Relation: IEEE/ACM 7th International Conference Utility and Cloud Computing (UCC), 2014; London; 8-11th December, 2014 p. 800-805
- Full Text: false
- Reviewed:
- Description: Data partitioning is a popular technique to horizontally or vertically split table attributes of a Cloud database cluster to evenly distribute increasing workloads. However, hot-spots can be created due to inappropriate partitioning scheme and static partition management without considering the dynamic workload characteristics. In this paper, an automatic database partition management scheme - APM - is proposed which periodically analyses workload logs to predict the formation of any potential hot-spot using association rule mining. A detailed illustration of the proposed scheme is presented with examples along with a cost model following by experimental observations from running a HBase cluster with YCSB workloads in AWS.
ACSP-Tree: A tree structure for mining behavioral patterns from wireless sensor networks
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2013
- Type: Text , Conference paper
- Relation: IEEE Conference on Local Computer Networks (LCN 2013) (21 October 2013 to 24 October 2013) p. 691-694
- Full Text: false
- Reviewed:
- Description: WSNs generates a large amount of data in the form of stream and mining knowledge from the stream of data can be extremely useful. Association rules mining, from the sensor data, has been studied in recent literature. However, sensor association rules mining often produces a huge number of rules, but most of them either are redundant or fail to reflect the true correlation relationship among data objects. In this paper, we address this problem and propose mining of a new type of sensor behavioral pattern called associated-correlated sensor patterns. The proposed behavioral patterns capture not only association-like co-occurrences but also the substantial temporal correlations implied by such co-occurrences in the sensor data. Here, we also use a prefix tree-based structure called associated-correlated sensor pattern-tree (ACSP-tree), which facilitates frequent pattern (FP) growth-based mining technique to generate all associated-correlated patterns from WSN data with only one scan over the sensor database. Extensive performance study shows that our approach is time and memory efficient in finding associated-correlated patterns than the existing most efficient algorithms.
An efficient classification using support vector machines
- Authors: Ruan, Ning , Chen, Yi , Gao, David
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings of 2013 Science and Information Conference, SAI 2013 p. 585-589
- Full Text: false
- Reviewed:
- Description: Support vector machine (SVM) is a popular method for classification in data mining. The canonical duality theory provides a unified analytic solution to a wide range of discrete and continuous problems in global optimization. This paper presents a canonical duality approach for solving support vector machine problem. It is shown that by the canonical duality, these nonconvex and integer optimization problems are equivalent to a unified concave maximization problem over a convex set and hence can be solved efficiently by existing optimization techniques. © 2013 The Science and Information Organization.
Malicious Spam Emails Developments and Authorship Attribution
- Authors: Alazab, Mamoun , Layton, Robert , Broadhurst, Roderic , Bouhours, Brigitte
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings - 4th Cybercrime and Trustworthy Computing Workshop, CTC 2013 p. 58-68
- Full Text: false
- Reviewed:
- Description: The Internet is a decentralized structure that offers speedy communication, has a global reach and provides anonymity, a characteristic invaluable for committing illegal activities. In parallel with the spread of the Internet, cybercrime has rapidly evolved from a relatively low volume crime to a common high volume crime. A typical example of such a crime is the spreading of spam emails, where the content of the email tries to entice the recipient to click a URL linking to a malicious Web site or downloading a malicious attachment. Analysts attempting to provide intelligence on spam activities quickly find that the volume of spam circulating daily is overwhelming; therefore, any intelligence gathered is representative of only a small sample, not of the global picture. While past studies have looked at automating some of these analyses using topic-based models, i.e. separating email clusters into groups with similar topics, our preliminary research investigates the usefulness of applying authorship-based models for this purpose. In the first phase, we clustered a set of spam emails using an authorship-based clustering algorithm. In the second phase, we analysed those clusters using a set of linguistic, structural and syntactic features. These analyses reveal that emails within each cluster were likely written by the same author, but that it is unlikely we have managed to group together all spam produced by each group. This problem of high purity with low recall, has been faced in past authorship research. While it is also a limitation of our research, the clusters themselves are still useful for the purposes of automating analysis, because they reduce the work needing to be performed. Our second phase revealed useful information on the group that can be utilized in future research for further analysis of such groups, for example, identifying further linkages behind spam campaigns.
Regularly frequent patterns mining from sensor data stream
- Authors: Rashid, Md. Mamunur , Gondal, Iqbal , Kamruzzaman, Joarder
- Date: 2013
- Type: Text , Conference paper
- Relation: International Conference on Neural Information Processing (ICONIP 2013) p. 417-424
- Full Text: false
- Reviewed:
- Description: Mining interesting and useful knowledge from the huge amount of data gathered in wireless sensor networks is a challenging task. Works reported in literature use support metric-based sensor association rule which employs the occurrence frequency of patterns as criteria. Such criteria may not be appropriate for finding significant patterns. Moreover, temporal regularity in occurrence behavior should be considered as another important measure for assessing the importance of patterns in WSNs. Frequent sensor patterns that occur after regular intervals is called regularly frequent sensor patterns. Even though mining regularly frequent sensor patterns from sensor data stream is extremely important in many real-time applications, no such algorithm has been proposed yet. In this paper, we propose a novel tree structure called Regularly Frequent Sensor Pattern-tree (RSP-tree) and an efficient mining approach for finding regularly frequent sensor patterns from WSNs. Extensive performance analyses show that our technique is time and memory efficient in finding regularly frequent sensor patterns.
Application of optimisation-based data mining techniques to medical data sets: A comparative analysis
- Authors: Dzalilov, Zari , Bagirov, Adil , Mammadov, Musa
- Date: 2012
- Type: Text , Conference paper
- Relation: IMMM 2102: The Second International Conference on Advances in Information Mining and Management p. 41-46
- Full Text: false
- Reviewed:
- Description: Abstract - Computational methods have become an important tool in the analysis of medical data sets. In this paper, we apply three optimisation-based data mining methods to the following data sets: (i) a cystic fibrosis data set and (ii) a tobacco control data set. Three algorithms used in the analysis of these data sets include: the modified linear least square fit, an optimization based heuristic algorithm for feature selection and an optimization based clustering algorithm. All these methods explore the relationship between features and classes, with the aim of determining contribution of specific features to the class outcome. However, the three algorithms are based on completely different approaches. We apply these methods to solve feature selection and classification problems. We also present comparative analysis of the algorithms using computational results. Results obtained confirm that these algorithms may be effectively applied to the analysis of other (bio)medical data sets
From convex to nonconvex: A loss function analysis for binary classification
- Authors: Zhao, Lei , Mammadov, Musa , Yearwood, John
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 p. 1281-1288
- Full Text:
- Reviewed:
- Description: Problems of data classification can be studied in the framework of regularization theory as ill-posed problems. In this framework, loss functions play an important role in the application of regularization theory to classification. In this paper, we review some important convex loss functions, including hinge loss, square loss, modified square loss, exponential loss, logistic regression loss, as well as some non-convex loss functions, such as sigmoid loss, ø-loss, ramp loss, normalized sigmoid loss, and the loss function of 2 layer neural network. Based on the analysis of these loss functions, we propose a new differentiable non-convex loss function, called smoothed 0-1 loss function, which is a natural approximation of the 0-1 loss function. To compare the performance of different loss functions, we propose two binary classification algorithms for binary classification, one for convex loss functions, the other for non-convex loss functions. A set of experiments are launched on several binary data sets from the UCI repository. The results show that the proposed smoothed 0-1 loss function is robust, especially for those noisy data sets with many outliers. © 2010 IEEE.
Identifying rootkit infections using data mining
- Authors: Lobo, Desmond , Watters, Paul , Wu, Xinwen
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 2010 International Conference on Information Science and Applications, ICISA 2010, Seoul, Korea : p. 1-7
- Full Text:
- Description: Rootkits refer to software that is used to hide the presence and activity of malware and permit an attacker to take control of a computer system. In our previous work, we focused strictly on identifying rootkits that use inline function hooking techniques to remain hidden. In this paper, we extend our previous work by including rootkits that use other types of hooking techniques, such as those that hook the IATs (Import Address Tables) and SSDTs (System Service Descriptor Tables). Unlike other malware identification techniques, our approach involved conducting dynamic analyses of various rootkits and then determining the family of each rootkit based on the hooks that had been created on the system. We demonstrated the effectiveness of this approach by first using the CLOPE (Clustering with sLOPE) algorithm to cluster a sample of rootkits into several families; next, the ID3 (Iterative Dichotomiser 3) algorithm was utilized to generate a decision tree for identifying the rootkit that had infected a machine. ©2010 IEEE.
RBACS : Rootkit behavioral analysis and classification system
- Authors: Lobo, Desmond , Watters, Paul , Wu, Xinwen
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 3rd International Conference on Knowledge Discovery and Data Mining, WKDD 2010, Phuket : 9th-10th January 2010 p. 75-80
- Full Text:
- Description: In this paper, we focus on rootkits, a special type of malicious software (malware) that operates in an obfuscated and stealthy mode to evade detection. Categorizing these rootkits will help in detecting future attacks against the business community. We first developed a theoretical framework for classifying rootkits. Based on our theoretical framework, we then proposed a new rootkit classification system and tested our system on a sample of rootkits that use inline function hooking. Our experimental results showed that our system could successfully categorize the sample using unsupervised clustering. © 2010 IEEE.
A classification algorithm that derives weighted sum scores for insight into disease
- Authors: Quinn, Anthony , Stranieri, Andrew , Yearwood, John , Hafen, Gaudenz
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at Third Australasian Workshop on Health Informatics and Knowledge Management (HIKM 2009), Wellington, New Zealand : Vol. 97, p. 13-17
- Full Text:
- Description: Data mining is often performed with datasets associated with diseases in order to increase insights that can ultimately lead to improved prevention or treatment. Classification algorithms can achieve high levels of predictive accuracy but have limited application for facilitating the insight that leads to deeper understanding of aspects of the disease. This is because the representation of knowledge that arises from classification algorithms is too opaque, too complex or too sparse to facilitate insight. Clustering, association and visualisation approaches enable greater scope for clinicians to be engaged in a way that leads to insight, however predictive accuracy is compromised or non-existent. This research investigates the practical applications of Automated Weighted Sum, (AWSum), a classification algorithm that provides accuracy comparable to other techniques whilst providing some insight into the data. This is achieved by calculating a weight for each feature value that represents its influence on the class value. Clinicians are very familiar with weighted sum scoring scales so the internal representation is intuitive and easily understood. This paper presents results from the use of the AWSum approach with data from patients suffering from Cystic Fibrosis.
AWSum - applying data mining in a health care scenario
- Authors: Quinn, Anthony , Jelinek, Herbert , Stranieri, Andrew , Yearwood, John
- Date: 2008
- Type: Text , Conference paper
- Relation: Paper presented at International Conference on Intelligent Sensors, Sensor Networks and Information Processing, ISSNIP 2008, Sydney, New South Wales : 15th-18th December 2008 p. 291-296
- Full Text:
- Description: This paper investigates the application of a new data mining algorithm called Automated Weighted Sum, (AWSum), to diabetes screening data to explore its use in providing researchers with new insight into the disease and secondarily to explore the potential the algorithm has for the generation of prognostic models for clinical use. There are many data mining classifiers that produce high levels of predictive accuracy but their application to health research and clinical applications is limited because they are complex, produce results that are difficult to interpret and are difficult to integrate with current knowledge and practises. This is because most focus on accuracy at the expense of informing the user as to the influences that lead to their classification results. By providing this information on influences a researcher can be pointed to new potentially interesting avenues for investigation. AWSum measures influence by calculating a weight for each feature value that represents its influence on a class value relative to other class values. The results produced, although on limited data, indicated the approach has potential uses for research and has some characteristics that may be useful in the future development of prognostic models.
- Description: 2003006660
Experimental investigation of clasification algorithms for ITS dataset
- Authors: Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2008
- Type: Text , Conference paper
- Relation: PKAW-08, Pacific Rim Knowledge Acquisition Workshop 2008, as part of PRICAI 2008, Tenth Pacific Rim p. 262-272
- Full Text: false
- Reviewed:
- Description: This article is devoted to experimental investigation of classification algorithms for analysis of ITS dataset. We introduce and consider a novel k-committees alogorithm for classification and compare it with the discrete k- means and nearest neighbour algorithms. The ITS dataset consists of nuclear ribosomal DNA sequences, where rather sophisticated alignment scores have to be used as a measure of distance. These scores do not form Minkowski metric and the sequences cannot be regarded as points in a finite dimensional space. This is why it is necessary to develop novel algorithms and adjust familiar ones. We present the results of experiments comparing the efficiency of three classification methods in their ability to achieve agreement with classes published in the biological literature before. It turns out that our algorithms are efficient and can be used to obtain biologically significant classifications. A simplified version of a synthetic dataset, where the k-committees classifier out performs k-means and Nearest Neighbour classifiers, is also presented.
- Description: E1
Isolation forest
- Authors: Liu, Fei , Ting, Kaiming , Zhou, Zhi-Hua
- Date: 2008
- Type: Text , Conference paper
- Relation: Proceedings of the Eighth IEEE International Conference on Data Mining p. 413-422
- Full Text: false
- Reviewed:
- Description: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.