A new dimensionality-unbiased score for efficient and effective outlying aspect mining
- Samariya, Durgesh, Ma, Jiangang
- Authors: Samariya, Durgesh , Ma, Jiangang
- Date: 2022
- Type: Text , Journal article
- Relation: Data Science and Engineering Vol. 7, no. 2 (2022), p. 120-135
- Full Text:
- Reviewed:
- Description: The main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before. © 2022, The Author(s).
- Authors: Samariya, Durgesh , Ma, Jiangang
- Date: 2022
- Type: Text , Journal article
- Relation: Data Science and Engineering Vol. 7, no. 2 (2022), p. 120-135
- Full Text:
- Reviewed:
- Description: The main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before. © 2022, The Author(s).
A new effective and efficient measure for outlying aspect mining
- Samariya, Durgesh, Aryal, Sunil, Ting, Kai, Ma, Jiangang
- Authors: Samariya, Durgesh , Aryal, Sunil , Ting, Kai , Ma, Jiangang
- Date: 2020
- Type: Text , Conference paper
- Relation: 21st International Conference on Web Information Systems Engineering, WISE 2020, Amsterdam. 20-24 October 2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics Vol. 12343 LNCS, p. 463-474
- Full Text: false
- Reviewed:
- Description: Outlying Aspect Mining (OAM) aims to find the subspaces (a.k.a. aspects) in which a given query is an outlier with respect to a given data set. Existing OAM algorithms use traditional distance/density-based outlier scores to rank subspaces. Because these distance/density-based scores depend on the dimensionality of subspaces, they cannot be compared directly between subspaces of different dimensionality. Z-score normalisation has been used to make them comparable. It requires to compute outlier scores of all instances in each subspace. This adds significant computational overhead on top of already expensive density estimation—making OAM algorithms infeasible to run in large and/or high-dimensional datasets. We also discover that Z-score normalisation is inappropriate for OAM in some cases. In this paper, we introduce a new score called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is independent of the dimensionality of subspaces. This enables the scores in subspaces with different dimensionalities to be compared directly without any additional normalisation. Our experimental results revealed that SiNNE produces better or at least the same results as existing scores; and it significantly improves the runtime of an existing OAM algorithm based on beam search. © 2020, Springer Nature Switzerland AG.
An efficient framework for mining outlying aspects
- Authors: Samariya, Durgesh
- Date: 2023
- Type: Text , Thesis , PhD
- Full Text:
- Description: In the era of big data, an immense volume of information is being continuously generated. It is common to encounter errors or anomalies within datasets. These anomalies can arise due to system malfunctions or human errors, resulting in data points that deviate from expected patterns or values. Anomaly detection algorithms have been developed to identify such anomalies effectively from the data. However, these algorithms often fall short in providing insights into why a particular data point is considered as an anomaly. They cannot explain the specific feature subset(s) in which a,data point significantly differs from the majority of the data. To address this limitation, researchers have recently turned their attention,to a new research area called outlying aspect mining. This area focuses on discovering feature subset(s), known as aspects or subspaces, in which anomalous data points exhibit significant deviations from the remaining set of data. Outlying aspect mining algorithms aim to provide a more detailed,understanding of the characteristics that make a data point anomalous. Although outlying aspect mining is an emerging area of research only a few studies have been published so far. One of the key challenges in this field is scaling up these algorithms to handle large datasets, characterised by,either a large data size or high dimensionality. Many existing outlying aspect mining algorithms are not well-suited for such datasets, as they exhaustively enumerate all possible subspaces and utilise density or distance-based anomaly scores to rank subspaces. As a result, most of these algorithms struggle to handle datasets with dimensions exceeding 20. Addressing the scalability issue and developing efficient algorithms for outlying aspect mining in large datasets remain active area of research. The ability to identify and understand the specific feature subsets contributing to anomalies in big data holds great potential for various applications, including fraud detection, network intrusion detection, and anomaly-based decision support systems. Existing outlying aspect mining methods are suffering from three main problems. Firstly, these measures often rely on distance or density-based calculations, which can be biased toward high-dimensional spaces. As the dimensionality of the subspace increases, the density tends to decrease, making it difficult to assess the outlyingness of data points within specific subspaces accurately. Secondly, distances or density-based measures are computationally expensive, especially when dealing with large-scale datasets that contain millions of data points. As distance and density-based measures require computing pairwise distance, it makes them computationally expensive. In addition to that, existing work uses Z-Score normalisation to make density-based scoring measure dimensionally unbias. However, it adds additional computational overload on already computationally expensive measures. Lastly, existing outlying aspect mining uses brute-force methods to search subspaces. Thus, it is essential to tackle this efficiency issue because when the dimensionality of the data is high – candidate subspace grows exponentially, which is beyond computational resources. This research project aims to solve this challenge by developing efficient and effective methods for mining outlying aspects in high-dimensional and large datasets. I have explored and designed different scoring measures to find the outlyingness of a given data point in each subspace. The effectiveness and efficiency of these proposed measures have been verified with extensive experiments on synthetic and real-world datasets. To overcome the first problem, this thesis first identifies and analyses the condition under which Z-Score based normalisation scoring measure fails to find the most outlying aspects and proposes two different approaches called HMass and sGrid++, both measures are dimensionally unbiased in their raw form, which means they do not require any additional normalisation. sGrid++ is a simpler version of sGrid that is not only efficient and effective but also dimensionality unbiased. It does not require Z-score normalisation. HMass is a simple but effective and efficient histogram-based solution to rank outlying aspects of a given query in each subspace. In addition to detecting anomalies, HMass provides explanations on why the points are anomalous. Both sGrid++ and HMass do not require pair-wise calculation like distance or density-based measures; therefore, sGrid++ and HMass are computationally faster than distance and density-based measures, which solves the second issue of existing work. The effectiveness and efficiency of both sGrid++ and HMass are evaluated using synthetic and real-world datasets. In addition to that, I presented an exciting application of outlying aspect mining in the cybersecurity domain. To tackle the third problem, this thesis proposes an efficient and effective outlying aspect mining framework named OIMiner (for Outlying - Inlying Aspect Miner). It introduces a new scoring measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalisation to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions, which was not possible before.
- Description: Doctor of Philosophy
- Authors: Samariya, Durgesh
- Date: 2023
- Type: Text , Thesis , PhD
- Full Text:
- Description: In the era of big data, an immense volume of information is being continuously generated. It is common to encounter errors or anomalies within datasets. These anomalies can arise due to system malfunctions or human errors, resulting in data points that deviate from expected patterns or values. Anomaly detection algorithms have been developed to identify such anomalies effectively from the data. However, these algorithms often fall short in providing insights into why a particular data point is considered as an anomaly. They cannot explain the specific feature subset(s) in which a,data point significantly differs from the majority of the data. To address this limitation, researchers have recently turned their attention,to a new research area called outlying aspect mining. This area focuses on discovering feature subset(s), known as aspects or subspaces, in which anomalous data points exhibit significant deviations from the remaining set of data. Outlying aspect mining algorithms aim to provide a more detailed,understanding of the characteristics that make a data point anomalous. Although outlying aspect mining is an emerging area of research only a few studies have been published so far. One of the key challenges in this field is scaling up these algorithms to handle large datasets, characterised by,either a large data size or high dimensionality. Many existing outlying aspect mining algorithms are not well-suited for such datasets, as they exhaustively enumerate all possible subspaces and utilise density or distance-based anomaly scores to rank subspaces. As a result, most of these algorithms struggle to handle datasets with dimensions exceeding 20. Addressing the scalability issue and developing efficient algorithms for outlying aspect mining in large datasets remain active area of research. The ability to identify and understand the specific feature subsets contributing to anomalies in big data holds great potential for various applications, including fraud detection, network intrusion detection, and anomaly-based decision support systems. Existing outlying aspect mining methods are suffering from three main problems. Firstly, these measures often rely on distance or density-based calculations, which can be biased toward high-dimensional spaces. As the dimensionality of the subspace increases, the density tends to decrease, making it difficult to assess the outlyingness of data points within specific subspaces accurately. Secondly, distances or density-based measures are computationally expensive, especially when dealing with large-scale datasets that contain millions of data points. As distance and density-based measures require computing pairwise distance, it makes them computationally expensive. In addition to that, existing work uses Z-Score normalisation to make density-based scoring measure dimensionally unbias. However, it adds additional computational overload on already computationally expensive measures. Lastly, existing outlying aspect mining uses brute-force methods to search subspaces. Thus, it is essential to tackle this efficiency issue because when the dimensionality of the data is high – candidate subspace grows exponentially, which is beyond computational resources. This research project aims to solve this challenge by developing efficient and effective methods for mining outlying aspects in high-dimensional and large datasets. I have explored and designed different scoring measures to find the outlyingness of a given data point in each subspace. The effectiveness and efficiency of these proposed measures have been verified with extensive experiments on synthetic and real-world datasets. To overcome the first problem, this thesis first identifies and analyses the condition under which Z-Score based normalisation scoring measure fails to find the most outlying aspects and proposes two different approaches called HMass and sGrid++, both measures are dimensionally unbiased in their raw form, which means they do not require any additional normalisation. sGrid++ is a simpler version of sGrid that is not only efficient and effective but also dimensionality unbiased. It does not require Z-score normalisation. HMass is a simple but effective and efficient histogram-based solution to rank outlying aspects of a given query in each subspace. In addition to detecting anomalies, HMass provides explanations on why the points are anomalous. Both sGrid++ and HMass do not require pair-wise calculation like distance or density-based measures; therefore, sGrid++ and HMass are computationally faster than distance and density-based measures, which solves the second issue of existing work. The effectiveness and efficiency of both sGrid++ and HMass are evaluated using synthetic and real-world datasets. In addition to that, I presented an exciting application of outlying aspect mining in the cybersecurity domain. To tackle the third problem, this thesis proposes an efficient and effective outlying aspect mining framework named OIMiner (for Outlying - Inlying Aspect Miner). It introduces a new scoring measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalisation to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions, which was not possible before.
- Description: Doctor of Philosophy
Anomaly detection on health data
- Samariya, Durgesh, Ma, Jiangang
- Authors: Samariya, Durgesh , Ma, Jiangang
- Date: 2022
- Type: Text , Conference paper
- Relation: 11th International Conference on Health Information Science, HIS 2022, Virtual, Online, 28- 30 October 2022, Health Information Science, 11th International Conference, HIS 2022, Virtual Event, October 28–30, 2022, Proceedings Vol. 13705 LNCS, p. 34-41
- Full Text: false
- Reviewed:
- Description: The identification of anomalous records in medical data is an important problem with numerous applications such as detecting anomalous reading, anomalous patient health condition, health insurance fraud detection and fault detection in mechanical components. This paper compares the performances of seven state-of-the-art anomaly detection algorithms to do detect anomalies in healthcare data. Our experimental results in six datasets show that the state-of-the-art method of isolation based method iForest has a better performance overall in terms of AUC and runtime. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Mining outlying aspects on healthcare data
- Samariya, Durgesh, Ma, Jiangang
- Authors: Samariya, Durgesh , Ma, Jiangang
- Date: 2021
- Type: Text , Conference paper
- Relation: 10th International Conference on Health Information Science, HIS 2021, Melbourne, 25-28 October 2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 13079 LNCS, p. 160-170
- Full Text: false
- Reviewed:
- Description: Machine learning and artificial intelligence have a wide range of applications in medical domain, such as detecting anomalous reading, anomalous patient health condition, etc. Many algorithms have been developed to solve this problem. However, they fail to answer why those entries are considered as an outlier. This research gap leads to outlying aspect mining problem. The problem of outlying aspect mining aims to discover the set of features (a.k.a subspace) in which the given data point is dramatically different than others. In this paper, we present an interesting application of outlying aspect mining in the medical domain. This paper aims to effectively and efficiently identify outlying aspects using different outlying aspect mining algorithms and evaluate their performance on different real-world healthcare datasets. The experimental results show that the latest isolation-based outlying aspect mining measure, SiNNE, have outstanding performance on this task and have promising results. © 2021, Springer Nature Switzerland AG.
- «
- ‹
- 1
- ›
- »