Mp-dissimilarity : A data dependent dissimilarity measure
- Authors: Aryal, Sunil , Ting, Kaiming , Haffari, Gholamreza , Washio, Takashi
- Date: 2014
- Type: Text , Conference paper
- Relation: 14th IEEE International Conference on Data Mining (2014 ICDM); Shenzhen, China; 14th-17th December 2014 p. 707-712
- Full Text: false
- Reviewed:
- Description: Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called 'mp-dissimilarity'. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
Beyond tf-idf and cosine distance in documents dissimilarity measure
- Authors: Aryal, Sunil , Ting, Kaiming , Haffari, Gholamreza , Washio, Takashi
- Date: 2015
- Type: Text , Conference proceedings
- Relation: Asia Information Retrieval Symposium 2015 - Queensland University of Technology, Brisbane, Australia, Brisbane, 2nd-4th Dec, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9460 Vol. 9460, p. 400-406
- Full Text: false
- Reviewed:
Half-space mass : a maximally robust and efficient data depth method
- Authors: Chen, Bo , Ting, Kaiming , Washio, Takashi , Haffari, Gholamreza
- Date: 2015
- Type: Text , Journal article
- Relation: Machine Learning Vol. 100, no. 2-3 (2015), p. 677-699
- Full Text: false
- Reviewed:
- Description: Data depth is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces Half-Space Mass which is a significantly improved version of half-space data depth. Half-Space Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of Half-Space Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of Half-Space Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than half-space depth; and it runs orders of magnitude faster than L2 depth, an existing maximally robust location estimator. In clustering, the Half-Space Mass version of K-means overcomes three weaknesses of K-means.
- Description: Data depth is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces
Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection
- Authors: Aryal, Sunil , Ting, Kaiming , Haffari, Gholamreza
- Date: 2016
- Type: Text , Conference proceedings
- Relation: 11th Pacific Asia Workshop on Intelligence and Security Informatics, PAISI 2016 - Auckland, New Zealand, 19th April, 2016 In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9650 p. 73-86
- Full Text: false
- Reviewed:
- Description: In this paper, we revisit the simple probabilistic approach of unsupervised anomaly detection by estimating multivariate probability as a product of univariate probabilities, assuming attributes are generated independently. We show that this simple traditional approach performs competitively to or better than five state-of-the-art unsupervised anomaly detection methods across a wide range of data sets from categorical, numeric or mixed domains. It is arguably the fastest anomaly detector. It is one order of magnitude faster than the fastest state-of-the- art method in high dimensional data sets.
Data-dependent dissimilarity measure : An effective alternative to geometric distance measures
- Authors: Aryal, Sunil , Ting, Kaiming , Washio, Takashi , Haffari, Gholamreza
- Date: 2017
- Type: Text , Journal article
- Relation: Knowledge and Information Systems Vol. 53, no. 2 (2017), p. 479-506
- Full Text: false
- Reviewed:
- Description: Nearest neighbor search is a core process in many data mining algorithms. Finding reliable closest matches of a test instance is still a challenging task as the effectiveness of many general-purpose distance measures such as ℓp -norm decreases as the number of dimensions increases. Their performances vary significantly in different data distributions. This is mainly because they compute the distance between two instances solely based on their geometric positions in the feature space, and data distribution has no influence on the distance measure. This paper presents a simple data-dependent general-purpose dissimilarity measure called ‘ mp -dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances as a probability mass in a region that encloses the two instances in every dimension. It deems two instances in a sparse region to be more similar than two instances of equal inter-point geometric distance in a dense region. Our empirical results in k-NN classification and content-based multimedia information retrieval tasks show that the proposed mp -dissimilarity measure produces better task-specific performance than existing widely used general-purpose distance measures such as ℓp -norm and cosine distance across a wide range of moderate- to high-dimensional data sets with continuous only, discrete only, and mixed attributes.