Cancer classification utilizing voting classifier with ensemble feature selection method and transcriptomic data
- Authors: Khatun, Rabea , Akter, Maksuda , Islam, Md Manowarul , Uddin, Md Ashraf , Talukder, Md Alamin , Kamruzzaman, Joarder , Azad, Akm , Paul, Bikash , Almoyad, Muhammad , Aryal, Sunil , Moni, Mohammad
- Date: 2023
- Type: Text , Journal article
- Relation: Genes Vol. 14, no. 9 (2023), p.
- Full Text:
- Reviewed:
- Description: Biomarker-based cancer identification and classification tools are widely used in bioinformatics and machine learning fields. However, the high dimensionality of microarray gene expression data poses a challenge for identifying important genes in cancer diagnosis. Many feature selection algorithms optimize cancer diagnosis by selecting optimal features. This article proposes an ensemble rank-based feature selection method (EFSM) and an ensemble weighted average voting classifier (VT) to overcome this challenge. The EFSM uses a ranking method that aggregates features from individual selection methods to efficiently discover the most relevant and useful features. The VT combines support vector machine, k-nearest neighbor, and decision tree algorithms to create an ensemble model. The proposed method was tested on three benchmark datasets and compared to existing built-in ensemble models. The results show that our model achieved higher accuracy, with 100% for leukaemia, 94.74% for colon cancer, and 94.34% for the 11-tumor dataset. This study concludes by identifying a subset of the most important cancer-causing genes and demonstrating their significance compared to the original data. The proposed approach surpasses existing strategies in accuracy and stability, significantly impacting the development of ML-based gene analysis. It detects vital genes with higher precision and stability than other existing methods. © 2023 by the authors.
Detection and explanation of anomalies in healthcare data
- Authors: Samariya, Durgesh , Ma, Jiangang , Aryal, Sunil , Zhao, Xiaohui
- Date: 2023
- Type: Text , Journal article
- Relation: Health Information Science and Systems Vol. 11, no. 1 (2023), p. 20-20
- Full Text: false
- Reviewed:
- Description: The growth of databases in the healthcare domain opens multiple doors for machine learning and artificial intelligence technology. Many medical devices are available in the medical field however, medical errors remain a severe challenge. Different algorithms are developed to identify and solve medical errors, such as detecting anomalous readings, anomalous health conditions of a patient, etc. However, they fail to answer why those entries are considered an anomaly. This research gap leads to an outlying aspect mining problem. The problem of outlying aspect mining aims to discover the set of features (a.k.a subspace) in which the given data point is dramatically different than others. In this paper, we present a framework that detects anomalies in healthcare data and then provides an explanation of anomalies. This paper aims to effectively and efficiently detect anomalies and explain why they are considered anomalies by detecting outlying aspects. First, we re-introduced four anomaly detection techniques and outlying aspect mining algorithms. Then, we evaluate the performance of anomaly detection techniques and choose the best anomaly detection algorithm. Later, we detect the top k anomaly as a query and detect their outlying aspect. Lastly, we evaluate their performance on 16 real-world healthcare datasets. The experimental results show that the latest isolation-based outlying aspect mining measure, SiNNE, has outstanding performance on this task and has promising results.
Levels of explainable artificial intelligence for human-aligned conversational explanations
- Authors: Dazeley, Richard , Vamplew, Peter , Foale, Cameron , Young, Cameron , Aryal, Sunil , Cruz, Francisco
- Date: 2021
- Type: Text , Journal article
- Relation: Artificial Intelligence Vol. 299, no. (2021), p.
- Full Text:
- Reviewed:
- Description: Over the last few years there has been rapid research growth into eXplainable Artificial Intelligence (XAI) and the closely aligned Interpretable Machine Learning (IML). Drivers for this growth include recent legislative changes and increased investments by industry and governments, along with increased concern from the general public. People are affected by autonomous decisions every day and the public need to understand the decision-making process to accept the outcomes. However, the vast majority of the applications of XAI/IML are focused on providing low-level ‘narrow’ explanations of how an individual decision was reached based on a particular datum. While important, these explanations rarely provide insights into an agent's: beliefs and motivations; hypotheses of other (human, animal or AI) agents' intentions; interpretation of external cultural expectations; or, processes used to generate its own explanation. Yet all of these factors, we propose, are essential to providing the explanatory depth that people require to accept and trust the AI's decision-making. This paper aims to define levels of explanation and describe how they can be integrated to create a human-aligned conversational explanation system. In so doing, this paper will survey current approaches and discuss the integration of different technologies to achieve these levels with Broad eXplainable Artificial Intelligence (Broad-XAI), and thereby move towards high-level ‘strong’ explanations. © 2021 Elsevier B.V.
Simple supervised dissimilarity measure : bolstering iForest-induced similarity with class information without learning
- Authors: Wells, Jonathan , Aryal, Sunil , Ting, Kai
- Date: 2020
- Type: Text , Journal article
- Relation: Knowledge and Information Systems Vol. 62, no. 8 (2020), p. 3203-3216
- Full Text: false
- Reviewed:
- Description: Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called me. Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures. © 2020, Springer-Verlag London Ltd., part of Springer Nature.
Data-dependent dissimilarity measure : An effective alternative to geometric distance measures
- Authors: Aryal, Sunil , Ting, Kaiming , Washio, Takashi , Haffari, Gholamreza
- Date: 2017
- Type: Text , Journal article
- Relation: Knowledge and Information Systems Vol. 53, no. 2 (2017), p. 479-506
- Full Text: false
- Reviewed:
- Description: Nearest neighbor search is a core process in many data mining algorithms. Finding reliable closest matches of a test instance is still a challenging task as the effectiveness of many general-purpose distance measures such as ℓp -norm decreases as the number of dimensions increases. Their performances vary significantly in different data distributions. This is mainly because they compute the distance between two instances solely based on their geometric positions in the feature space, and data distribution has no influence on the distance measure. This paper presents a simple data-dependent general-purpose dissimilarity measure called ‘ mp -dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances as a probability mass in a region that encloses the two instances in every dimension. It deems two instances in a sparse region to be more similar than two instances of equal inter-point geometric distance in a dense region. Our empirical results in k-NN classification and content-based multimedia information retrieval tasks show that the proposed mp -dissimilarity measure produces better task-specific performance than existing widely used general-purpose distance measures such as ℓp -norm and cosine distance across a wide range of moderate- to high-dimensional data sets with continuous only, discrete only, and mixed attributes.
Defying the gravity of learning curve : A characteristic of nearest neighbour anomaly detectors
- Authors: Ting, Kaiming , Washio, Takashi , Wells, Jonathan , Aryal, Sunil
- Date: 2017
- Type: Text , Journal article
- Relation: Machine Learning Vol. 106, no. 1 (2017), p. 55-91
- Full Text: false
- Reviewed:
- Description: Conventional wisdom in machine learning says that all algorithms are expected to follow the trajectory of a learning curve which is often colloquially referred to as ‘more data the better’. We call this ‘the gravity of learning curve’, and it is assumed that no learning algorithms are ‘gravity-defiant’. Contrary to the conventional wisdom, this paper provides the theoretical analysis and the empirical evidence that nearest neighbour anomaly detectors are gravity-defiant algorithms.
A generic ensemble approach to estimate multidimensional likelihood in Bayesian classifier learning
- Authors: Aryal, Sunil , Ting, Kaiming
- Date: 2016
- Type: Text , Journal article
- Relation: Computational Intelligence Vol. 32, no. 3 (2016), p. 458-479
- Full Text: false
- Reviewed:
- Description: In Bayesian classifier learning, estimating the joint probability distribution (,) or the likelihood (|) directly from training data is considered to be difficult, especially in large multidimensional data sets. To circumvent this difficulty, existing Bayesian classifiers such as Naive Bayes, BayesNet, and ADE have focused on estimating simplified surrogates of (,) from different forms of one‐dimensional likelihoods. Contrary to the perceived difficulty in multidimensional likelihood estimation, we present a simple generic ensemble approach to estimate multidimensional likelihood directly from data. The idea is to aggregate (|) estimated from a random subsample of data . This article presents two ways to estimate multidimensional likelihoods using the proposed generic approach and introduces two new Bayesian classifiers called and that estimate (|) using a nearest‐neighbor density estimation and a probability estimation through feature space partitioning, respectively. Unlike the existing Bayesian classifiers, ENNBayes and MassBayes have constant training time and space complexities and they scale better than existing Bayesian classifiers in very large data sets. Our empirical evaluation shows that ENNBayes and MassBayes yield better predictive accuracy than the existing Bayesian classifiers in benchmark data sets.
DEMass: a new density estimator for big data
- Authors: Ting, Kaiming , Washio, Takashi , Wells, Jonathan , Liu, Fei , Aryal, Sunil
- Date: 2013
- Type: Text , Journal article
- Relation: Knowledge and Information Systems Vol. 35, no. 3 (2013), p. 493-524
- Full Text: false
- Reviewed:
- Description: Density estimation is the ubiquitous base modelling mechanism employed for many tasks including clustering, classification, anomaly detection and information retrieval. Commonly used density estimation methods such as kernel density estimator and k-nearest neighbour density estimator have high time and space complexities which render them inapplicable in problems with big data. This weakness sets the fundamental limit in existing algorithms for all these tasks. We propose the first density estimation method, having average case sub-linear time complexity and constant space complexity in the number of instances, that stretches this fundamental limit to an extent that dealing with millions of data can now be done easily and quickly. We provide an asymptotic analysis of the new density estimator and verify the generality of the method by replacing existing density estimators with the new one in three current density-based algorithms, namely DBSCAN, LOF and Bayesian classifiers, representing three different data mining tasks of clustering, anomaly detection and classification. Our empirical evaluation results show that the new density estimation method significantly improves their time and space complexities, while maintaining or improving their task-specific performances in clustering, anomaly detection and classification. The new method empowers these algorithms, currently limited to small data size only, to process big data—setting a new benchmark for what density-based algorithms can achieve.