Rees matrix constructions for clustering of data
- Kelarev, Andrei, Watters, Paul, Yearwood, John
- Authors: Kelarev, Andrei , Watters, Paul , Yearwood, John
- Date: 2009
- Type: Journal article
- Relation: Journal of the Australian Mathematical Society Vol. 87, no. 3 (2009), p. 377-393
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text:
- Reviewed:
- Description: This paper continues the investigation of semigroup constructions motivated by applications in data mining. We give a complete description of the error-correcting capabilities of a large family of clusterers based on Rees matrix semigroups well known in semigroup theory. This result strengthens and complements previous formulas recently obtained in the literature. Examples show that our theorems do not generalize to other classes of semigroups.
- Authors: Kelarev, Andrei , Watters, Paul , Yearwood, John
- Date: 2009
- Type: Journal article
- Relation: Journal of the Australian Mathematical Society Vol. 87, no. 3 (2009), p. 377-393
- Relation: http://purl.org/au-research/grants/arc/DP0211866
- Full Text:
- Reviewed:
- Description: This paper continues the investigation of semigroup constructions motivated by applications in data mining. We give a complete description of the error-correcting capabilities of a large family of clusterers based on Rees matrix semigroups well known in semigroup theory. This result strengthens and complements previous formulas recently obtained in the literature. Examples show that our theorems do not generalize to other classes of semigroups.
Establishing phishing provenance using orthographic features
- Liping, Ma, Yearwood, John, Watters, Paul
- Authors: Liping, Ma , Yearwood, John , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at 2009 eCrime Researchers Summit, eCRIME '09, Tacoma, Washington : 20th-21st October 2009
- Full Text:
- Description: After phishing message detection, determining the provenance of phishing messages and Websites is the second step to tracing cybercriminals. In this paper, we present a novel method to cluster phishing emails automatically using orthographic features. In particular, we develop an algorithm to cluster documents and remove redundant features at the same time. After collecting all the possible features based on observation, we adapt the modified global k-mean method repeatedly, and generate the objective function values over a range of tolerance values across different subsets of features. Finally, we identify the appropriate clusters based on studying the distribution of the objective function values. Experimental evaluation of a large number of computations demonstrates that our clustering and feature selection techniques are highly effective and achieve reliable results.
- Description: 2003007842
- Authors: Liping, Ma , Yearwood, John , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at 2009 eCrime Researchers Summit, eCRIME '09, Tacoma, Washington : 20th-21st October 2009
- Full Text:
- Description: After phishing message detection, determining the provenance of phishing messages and Websites is the second step to tracing cybercriminals. In this paper, we present a novel method to cluster phishing emails automatically using orthographic features. In particular, we develop an algorithm to cluster documents and remove redundant features at the same time. After collecting all the possible features based on observation, we adapt the modified global k-mean method repeatedly, and generate the objective function values over a range of tolerance values across different subsets of features. Finally, we identify the appropriate clusters based on studying the distribution of the objective function values. Experimental evaluation of a large number of computations demonstrates that our clustering and feature selection techniques are highly effective and achieve reliable results.
- Description: 2003007842
The choice of a similarity measure with respect to its sensitivity to outliers
- Rubinov, Alex, Sukhorukova, Nadezda, Ugon, Julien
- Authors: Rubinov, Alex , Sukhorukova, Nadezda , Ugon, Julien
- Date: 2010
- Type: Text , Journal article
- Relation: Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications and Algorithms Vol. 17, no. 5 (2010), p. 709-721
- Full Text:
- Reviewed:
- Description: This paper examines differences in the choice of similarity measures with respect to their sensitivity to outliers in clustering problems, formulated as mathematical programming problems. Namely, we are focusing on the study of norms (norm-based similarity measures) and convex functions of norms (function-norm-based similarity measures). The study consists of two parts: the study of theoretical models and numerical experiments. The main result of this study is a criterion for the outliers sensitivity with respect to the corresponding similarity measure. In particular, the obtained results show that the norm-based similarity measures are not sensitive to outliers whilst a very widely used square of the Euclidean norm similarity measure (least squares) is sensitive to outliers. Copyright © 2010 Watam Press.
- Authors: Rubinov, Alex , Sukhorukova, Nadezda , Ugon, Julien
- Date: 2010
- Type: Text , Journal article
- Relation: Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications and Algorithms Vol. 17, no. 5 (2010), p. 709-721
- Full Text:
- Reviewed:
- Description: This paper examines differences in the choice of similarity measures with respect to their sensitivity to outliers in clustering problems, formulated as mathematical programming problems. Namely, we are focusing on the study of norms (norm-based similarity measures) and convex functions of norms (function-norm-based similarity measures). The study consists of two parts: the study of theoretical models and numerical experiments. The main result of this study is a criterion for the outliers sensitivity with respect to the corresponding similarity measure. In particular, the obtained results show that the norm-based similarity measures are not sensitive to outliers whilst a very widely used square of the Euclidean norm similarity measure (least squares) is sensitive to outliers. Copyright © 2010 Watam Press.
Tourism clusters : Uncovering destination value chains
- Hollick, Mary, Braun, Patrice
- Authors: Hollick, Mary , Braun, Patrice
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at CAUTHE 2006 conference - to the city and beyond, Melbourne, Victoria : 6th February, 2006 p. 476-485
- Full Text:
- Reviewed:
- Description: This paper discusses the role of tourism networks, clustering and destination value chains for micro and small and medium size tourism enterprises (SMEs) in freely assembled destinations. In discussing destination benefits and barriers surrounding SME clustering, SME positioning and performance are highlighted. It is proposed in this paper that SME clustering and value are not always naturally established. Successful destination clusters may be created by upgrading SME performance, analysing local value chains and matching both tangible and intangible sources of value, such as systems, leadership, relationships and brands with demand-side value segmentation.
- Description: E1
- Description: 2003001808
- Authors: Hollick, Mary , Braun, Patrice
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at CAUTHE 2006 conference - to the city and beyond, Melbourne, Victoria : 6th February, 2006 p. 476-485
- Full Text:
- Reviewed:
- Description: This paper discusses the role of tourism networks, clustering and destination value chains for micro and small and medium size tourism enterprises (SMEs) in freely assembled destinations. In discussing destination benefits and barriers surrounding SME clustering, SME positioning and performance are highlighted. It is proposed in this paper that SME clustering and value are not always naturally established. Successful destination clusters may be created by upgrading SME performance, analysing local value chains and matching both tangible and intangible sources of value, such as systems, leadership, relationships and brands with demand-side value segmentation.
- Description: E1
- Description: 2003001808
Modified global k-means algorithm for clustering in gene expression data sets
- Bagirov, Adil, Mardaneh, Karim
- Authors: Bagirov, Adil , Mardaneh, Karim
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at Intelligent Systems for Bioinformatics 2006, proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, Hobart, Tasmania : 4th December, 2006
- Full Text:
- Reviewed:
- Description: Clustering in gene expression data sets is a challenging problem. Different algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its different variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are significantly different from global solutions as the number of clusters increases. Over the last several years different approaches have been proposed to improve global search properties of k-means algorithm and its performance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modified k-means algorithm improves and sometimes significantly results by k-means and global k-means algorithms.
- Description: E1
- Description: 2003001713
- Authors: Bagirov, Adil , Mardaneh, Karim
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at Intelligent Systems for Bioinformatics 2006, proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, Hobart, Tasmania : 4th December, 2006
- Full Text:
- Reviewed:
- Description: Clustering in gene expression data sets is a challenging problem. Different algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its different variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are significantly different from global solutions as the number of clusters increases. Over the last several years different approaches have been proposed to improve global search properties of k-means algorithm and its performance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modified k-means algorithm improves and sometimes significantly results by k-means and global k-means algorithms.
- Description: E1
- Description: 2003001713
An experiment in task decomposition and ensembling for a modular artificial neural network
- Ferguson, Brent, Ghosh, Ranadhir, Yearwood, John
- Authors: Ferguson, Brent , Ghosh, Ranadhir , Yearwood, John
- Date: 2004
- Type: Text , Conference paper
- Relation: Paper presented at Innovations in Applied Artificial Intelligence: 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Ottawa, Canada : 17th May, 2004
- Full Text:
- Reviewed:
- Description: Modular neural networks have the possibility of overcoming common scalability and interference problems experienced by fully connected neural networks when applied to large databases. In this paper we trial an approach to constructing modular ANN's for a very large problem from CEDAR for the classification of handwritten characters. In our approach, we apply progressive task decomposition methods based upon clustering and regression techniques to find modules. We then test methods for combining the modules into ensembles and compare their structural characteristics and classification performance with that of an ANN having a fully connected topology. The results reveal improvements to classification rates as well as network topologies for this problem.
- Description: E1
- Description: 2003000852
- Authors: Ferguson, Brent , Ghosh, Ranadhir , Yearwood, John
- Date: 2004
- Type: Text , Conference paper
- Relation: Paper presented at Innovations in Applied Artificial Intelligence: 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Ottawa, Canada : 17th May, 2004
- Full Text:
- Reviewed:
- Description: Modular neural networks have the possibility of overcoming common scalability and interference problems experienced by fully connected neural networks when applied to large databases. In this paper we trial an approach to constructing modular ANN's for a very large problem from CEDAR for the classification of handwritten characters. In our approach, we apply progressive task decomposition methods based upon clustering and regression techniques to find modules. We then test methods for combining the modules into ensembles and compare their structural characteristics and classification performance with that of an ANN having a fully connected topology. The results reveal improvements to classification rates as well as network topologies for this problem.
- Description: E1
- Description: 2003000852
Unsupervised and supervised data classification via nonsmooth and global optimisation
- Bagirov, Adil, Rubinov, Alex, Sukhorukova, Nadezda, Yearwood, John
- Authors: Bagirov, Adil , Rubinov, Alex , Sukhorukova, Nadezda , Yearwood, John
- Date: 2003
- Type: Text , Journal article
- Relation: Top Vol. 11, no. 1 (2003), p. 1-92
- Full Text:
- Reviewed:
- Description: We examine various methods for data clustering and data classification that are based on the minimization of the so-called cluster function and its modications. These functions are nonsmooth and nonconvex. We use Discrete Gradient methods for their local minimization. We consider also a combination of this method with the cutting angle method for global minimization. We present and discuss results of numerical experiments.
- Description: C1
- Description: 2003000421
- Authors: Bagirov, Adil , Rubinov, Alex , Sukhorukova, Nadezda , Yearwood, John
- Date: 2003
- Type: Text , Journal article
- Relation: Top Vol. 11, no. 1 (2003), p. 1-92
- Full Text:
- Reviewed:
- Description: We examine various methods for data clustering and data classification that are based on the minimization of the so-called cluster function and its modications. These functions are nonsmooth and nonconvex. We use Discrete Gradient methods for their local minimization. We consider also a combination of this method with the cutting angle method for global minimization. We present and discuss results of numerical experiments.
- Description: C1
- Description: 2003000421
Small-to-medium enterprises and economic growth : A comparative study of clustering techniques
- Authors: Mardaneh, Karim
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Modern Applied Statistical Methods Vol. 11, no. 2 (2012), p. 469-478
- Full Text:
- Reviewed:
- Description: Small-to-medium enterprises (SMEs) in regional (non-metropolitan) areas are considered when economic planning may require large data sets and sophisticated clustering techniques. The economic growth of regional areas was investigated using four clustering algorithms. Empirical analysis demonstrated that the modified global k-means algorithm outperformed other algorithms. © 2012 JMASM, Inc.
- Description: 2003010429
- Authors: Mardaneh, Karim
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Modern Applied Statistical Methods Vol. 11, no. 2 (2012), p. 469-478
- Full Text:
- Reviewed:
- Description: Small-to-medium enterprises (SMEs) in regional (non-metropolitan) areas are considered when economic planning may require large data sets and sophisticated clustering techniques. The economic growth of regional areas was investigated using four clustering algorithms. Empirical analysis demonstrated that the modified global k-means algorithm outperformed other algorithms. © 2012 JMASM, Inc.
- Description: 2003010429
Application of rank correlation, clustering and classification in information security
- Beliakov, Gleb, Yearwood, John, Kelarev, Andrei
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Networks Vol. 7, no. 6 (2012), p. 935-945
- Full Text:
- Reviewed:
- Description: This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman-Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms. © 2012 Academy Publisher.
- Description: 2003010277
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Networks Vol. 7, no. 6 (2012), p. 935-945
- Full Text:
- Reviewed:
- Description: This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman-Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms. © 2012 Academy Publisher.
- Description: 2003010277
Applications of functional data analysis : A systematic review
- Ullah, Shahid, Finch, Caroline
- Authors: Ullah, Shahid , Finch, Caroline
- Date: 2013
- Type: Text , Journal article
- Relation: BMC Medical Research Methodology Vol. 13, no. 43 (2013), p.1-12
- Relation: http://purl.org/au-research/grants/nhmrc/565900
- Full Text:
- Reviewed:
- Description: Background Functional data analysis (FDA) is increasingly being used to better analyze, model and predict time series data. Key aspects of FDA include the choice of smoothing technique, data reduction, adjustment for clustering, functional linear modeling and forecasting methods. Methods A systematic review using 11 electronic databases was conducted to identify FDA application studies published in the peer-review literature during 1995–2010. Papers reporting methodological considerations only were excluded, as were non-English articles. Results In total, 84 FDA application articles were identified; 75.0% of the reviewed articles have been published since 2005. Application of FDA has appeared in a large number of publications across various fields of sciences; the majority is related to biomedicine applications (21.4%). Overall, 72 studies (85.7%) provided information about the type of smoothing techniques used, with B-spline smoothing (29.8%) being the most popular. Functional principal component analysis (FPCA) for extracting information from functional data was reported in 51 (60.7%) studies. One-quarter (25.0%) of the published studies used functional linear models to describe relationships between explanatory and outcome variables and only 8.3% used FDA for forecasting time series data. Conclusions Despite its clear benefits for analyzing time series data, full appreciation of the key features and value of FDA have been limited to date, though the applications show its relevance to many public health and biomedical problems. Wider application of FDA to all studies involving correlated measurements should allow better modeling of, and predictions from, such data in the future especially as FDA makes no a priori age and time effects assumptions.
- Authors: Ullah, Shahid , Finch, Caroline
- Date: 2013
- Type: Text , Journal article
- Relation: BMC Medical Research Methodology Vol. 13, no. 43 (2013), p.1-12
- Relation: http://purl.org/au-research/grants/nhmrc/565900
- Full Text:
- Reviewed:
- Description: Background Functional data analysis (FDA) is increasingly being used to better analyze, model and predict time series data. Key aspects of FDA include the choice of smoothing technique, data reduction, adjustment for clustering, functional linear modeling and forecasting methods. Methods A systematic review using 11 electronic databases was conducted to identify FDA application studies published in the peer-review literature during 1995–2010. Papers reporting methodological considerations only were excluded, as were non-English articles. Results In total, 84 FDA application articles were identified; 75.0% of the reviewed articles have been published since 2005. Application of FDA has appeared in a large number of publications across various fields of sciences; the majority is related to biomedicine applications (21.4%). Overall, 72 studies (85.7%) provided information about the type of smoothing techniques used, with B-spline smoothing (29.8%) being the most popular. Functional principal component analysis (FPCA) for extracting information from functional data was reported in 51 (60.7%) studies. One-quarter (25.0%) of the published studies used functional linear models to describe relationships between explanatory and outcome variables and only 8.3% used FDA for forecasting time series data. Conclusions Despite its clear benefits for analyzing time series data, full appreciation of the key features and value of FDA have been limited to date, though the applications show its relevance to many public health and biomedical problems. Wider application of FDA to all studies involving correlated measurements should allow better modeling of, and predictions from, such data in the future especially as FDA makes no a priori age and time effects assumptions.
A new perceptual dissimilarity measure for image retrieval and clustering
- Authors: Shojanazeri, Hamid
- Date: 2018
- Type: Text , Thesis , PhD
- Full Text:
- Description: Image retrieval and clustering are two important tools for analysing and organising images. Dissimilarity measure is central to both image retrieval and clustering. The performance of image retrieval and clustering algorithms depends on the effectiveness of the dissimilarity measure. ‘Minkowski’ distance, or more specifically, ‘Euclidean’ distance, is the most widely used dissimilarity measure in image retrieval and clustering. Euclidean distance depends only on the geometric position of two data instances in the feature space and completely ignores the data distribution. However, data distribution has an effect on human perception. The argument that two data instances in a dense area are more perceptually dissimilar than the same two instances in a sparser area, is proposed by psychologists. Based on this idea, a dissimilarity measure called, ‘mp’, has been proposed to address Euclidean distance’s limitation of ignoring the data distribution. Here, mp relies on data distribution to calculate the dissimilarity between two instances. As prescribed in mp, higher data mass between two data instances implies higher dissimilarity, and vice versa. mp relies only on data distribution and completely ignores the geometric distance in its calculations. In the aggregation of dissimilarities between two instances over all the dimensions in feature space, both Euclidean distance and mp give same priority to all the dimensions. This may result in a situation that the final dissimilarity between two data instances is determined by a few dimensions of feature vectors with relatively much higher values. As a result, the dissimilarity derived may not align well with human perception. The need to address the limitations of Minkowski distance measures, along with the importance of a dissimilarity measure that considers both geometric distance and the perceptual effect of data distribution in measuring dissimilarity between images motivated this thesis. It studies the performance of mp for image retrieval. It investigates a new dissimilarity measure that combines both Euclidean distance and data distribution. In addition to these, it studies the performance of such a dissimilarity measure for image retrieval and clustering. Our performance study of mp for image retrieval shows that relying only on data distribution to measure the dissimilarity results in some situations, where the mp’s measurement is contrary to human perception. This thesis introduces a new dissimilarity measure called, perceptual dissimilarity measure (PDM). PDM considers the perceptual effect of data distribution in combination with Euclidean distance. PDM has two variants, PDM1 and PDM2. PDM1 focuses on improving mp by weighting it using Euclidean distance in situations where mp may not retrieve accurate results. PDM2 considers the effect of data distribution on the perceived dissimilarity measured by Euclidean distance. PDM2 proposes a weighting system for Euclidean distance using a logarithmic transform of data mass. The proposed PDM variants have been used as alternatives to Euclidean distance and mp to improve the accuracy in image retrieval. Our results show that PDM2 has consistently performed the best, compared to Euclidean distance, mp and PDM1. PDM1’s performance was not consistent, although it has performed better than mp in all the experiments, but it could not outperform Euclidean distance in some cases. Following the promising results of PDM2 in image retrieval, we have studied its performance for image clustering. k-means is the most widely used clustering algorithm in scientific and industrial applications. k-medoids is the closest clustering algorithm to k-means. Unlike k-means which works only with Euclidean distance, k-medoids gives the option to choose the arbitrary dissimilarity measure. We have used Euclidean distance, mp and PDM2 as the dissimilarity measure in k-medoids and compared the results with k-means. Our clustering results show that PDM2 has perfromed overally the best. This confirms our retrieval results and identifies PDM2 as a suitable dissimilarity measure for image retrieval and clustering.
- Description: Doctor of Philosophy
- Authors: Shojanazeri, Hamid
- Date: 2018
- Type: Text , Thesis , PhD
- Full Text:
- Description: Image retrieval and clustering are two important tools for analysing and organising images. Dissimilarity measure is central to both image retrieval and clustering. The performance of image retrieval and clustering algorithms depends on the effectiveness of the dissimilarity measure. ‘Minkowski’ distance, or more specifically, ‘Euclidean’ distance, is the most widely used dissimilarity measure in image retrieval and clustering. Euclidean distance depends only on the geometric position of two data instances in the feature space and completely ignores the data distribution. However, data distribution has an effect on human perception. The argument that two data instances in a dense area are more perceptually dissimilar than the same two instances in a sparser area, is proposed by psychologists. Based on this idea, a dissimilarity measure called, ‘mp’, has been proposed to address Euclidean distance’s limitation of ignoring the data distribution. Here, mp relies on data distribution to calculate the dissimilarity between two instances. As prescribed in mp, higher data mass between two data instances implies higher dissimilarity, and vice versa. mp relies only on data distribution and completely ignores the geometric distance in its calculations. In the aggregation of dissimilarities between two instances over all the dimensions in feature space, both Euclidean distance and mp give same priority to all the dimensions. This may result in a situation that the final dissimilarity between two data instances is determined by a few dimensions of feature vectors with relatively much higher values. As a result, the dissimilarity derived may not align well with human perception. The need to address the limitations of Minkowski distance measures, along with the importance of a dissimilarity measure that considers both geometric distance and the perceptual effect of data distribution in measuring dissimilarity between images motivated this thesis. It studies the performance of mp for image retrieval. It investigates a new dissimilarity measure that combines both Euclidean distance and data distribution. In addition to these, it studies the performance of such a dissimilarity measure for image retrieval and clustering. Our performance study of mp for image retrieval shows that relying only on data distribution to measure the dissimilarity results in some situations, where the mp’s measurement is contrary to human perception. This thesis introduces a new dissimilarity measure called, perceptual dissimilarity measure (PDM). PDM considers the perceptual effect of data distribution in combination with Euclidean distance. PDM has two variants, PDM1 and PDM2. PDM1 focuses on improving mp by weighting it using Euclidean distance in situations where mp may not retrieve accurate results. PDM2 considers the effect of data distribution on the perceived dissimilarity measured by Euclidean distance. PDM2 proposes a weighting system for Euclidean distance using a logarithmic transform of data mass. The proposed PDM variants have been used as alternatives to Euclidean distance and mp to improve the accuracy in image retrieval. Our results show that PDM2 has consistently performed the best, compared to Euclidean distance, mp and PDM1. PDM1’s performance was not consistent, although it has performed better than mp in all the experiments, but it could not outperform Euclidean distance in some cases. Following the promising results of PDM2 in image retrieval, we have studied its performance for image clustering. k-means is the most widely used clustering algorithm in scientific and industrial applications. k-medoids is the closest clustering algorithm to k-means. Unlike k-means which works only with Euclidean distance, k-medoids gives the option to choose the arbitrary dissimilarity measure. We have used Euclidean distance, mp and PDM2 as the dissimilarity measure in k-medoids and compared the results with k-means. Our clustering results show that PDM2 has perfromed overally the best. This confirms our retrieval results and identifies PDM2 as a suitable dissimilarity measure for image retrieval and clustering.
- Description: Doctor of Philosophy
REPLOT : REtrieving Profile Links on Twitter for malicious campaign discovery
- Perez, Charles, Birregah, Babiga, Layton, Robert, Lemercier, Marc, Watters, Paul
- Authors: Perez, Charles , Birregah, Babiga , Layton, Robert , Lemercier, Marc , Watters, Paul
- Date: 2015
- Type: Text , Journal article
- Relation: AI Communications Vol. 29, no. 1 (2015), p. 107-122
- Full Text:
- Reviewed:
- Description: Social networking sites are increasingly subject to malicious activities such as self-propagating worms, confidence scams and drive-by-download malwares. The high number of users associated with the presence of sensitive data, such as personal or professional information, is certainly an unprecedented opportunity for attackers. These attackers are moving away from previous platforms of attack, such as emails, towards social networking websites. In this paper, we present a full stack methodology for the identification of campaigns of malicious profiles on social networking sites, composed of maliciousness classification, campaign discovery and attack profiling. The methodology named REPLOT, for REtrieving Profile Links On Twitter, contains three major phases. First, profiles are analysed to determine whether they are more likely to be malicious or benign. Second, connections between suspected malicious profiles are retrieved using a late data fusion approach consisting of temporal and authorship analysis based models to discover campaigns. Third, the analysis of the discovered campaigns is performed to investigate the attacks. In this paper, we apply this methodology to a real world dataset, with a view to understanding the links between malicious profiles, their attack methods and their connections. Our analysis identifies a cluster of linked profiles focusing on propagating malicious links, as well as profiling two other major clusters of attacking campaigns. © 2016 - IOS Press and the authors. All rights reserved.
- Authors: Perez, Charles , Birregah, Babiga , Layton, Robert , Lemercier, Marc , Watters, Paul
- Date: 2015
- Type: Text , Journal article
- Relation: AI Communications Vol. 29, no. 1 (2015), p. 107-122
- Full Text:
- Reviewed:
- Description: Social networking sites are increasingly subject to malicious activities such as self-propagating worms, confidence scams and drive-by-download malwares. The high number of users associated with the presence of sensitive data, such as personal or professional information, is certainly an unprecedented opportunity for attackers. These attackers are moving away from previous platforms of attack, such as emails, towards social networking websites. In this paper, we present a full stack methodology for the identification of campaigns of malicious profiles on social networking sites, composed of maliciousness classification, campaign discovery and attack profiling. The methodology named REPLOT, for REtrieving Profile Links On Twitter, contains three major phases. First, profiles are analysed to determine whether they are more likely to be malicious or benign. Second, connections between suspected malicious profiles are retrieved using a late data fusion approach consisting of temporal and authorship analysis based models to discover campaigns. Third, the analysis of the discovered campaigns is performed to investigate the attacks. In this paper, we apply this methodology to a real world dataset, with a view to understanding the links between malicious profiles, their attack methods and their connections. Our analysis identifies a cluster of linked profiles focusing on propagating malicious links, as well as profiling two other major clusters of attacking campaigns. © 2016 - IOS Press and the authors. All rights reserved.
Industry type and business size on economic growth: Comparing Australia's Regional and Metropolitan areas
- Authors: Mardaneh, Karim
- Date: 2011
- Type: Text , Conference proceedings
- Relation: 56th Annual ICSB World Conference; Back to the Future - Changes in Perspectives of Global Entrepreneurship and Innovation,Stockholm, Sweden, 15-18 June, 2011
- Full Text:
- Reviewed:
- Description: While the main body of literature regarding small-to-medium enterprises is focused on formation and growth, there is insufficient research about the role of both (a) firm size and (b) location on economic growth. The role of firm size and industrial structure on economic growth has been examined by some researchers. Pagano (2003) and Pagano and Schivardi (2000) identified a positive association between average firm size and growth and Carree and Thurik (1999) found evidence that the low number of large firms in an industry could lead to a higher value added growth. The current study attempts to investigate the impact of industry structure and businesses operating within these industries on economic growth. This paper uses “k-means” clustering algorithm to cluster Statistical Local Areas. Regression analysis is utilised to identify drivers of economic growth. Preliminary results suggest that size of business may act as a driver of economic growth but the impact could vary based on location.
- Authors: Mardaneh, Karim
- Date: 2011
- Type: Text , Conference proceedings
- Relation: 56th Annual ICSB World Conference; Back to the Future - Changes in Perspectives of Global Entrepreneurship and Innovation,Stockholm, Sweden, 15-18 June, 2011
- Full Text:
- Reviewed:
- Description: While the main body of literature regarding small-to-medium enterprises is focused on formation and growth, there is insufficient research about the role of both (a) firm size and (b) location on economic growth. The role of firm size and industrial structure on economic growth has been examined by some researchers. Pagano (2003) and Pagano and Schivardi (2000) identified a positive association between average firm size and growth and Carree and Thurik (1999) found evidence that the low number of large firms in an industry could lead to a higher value added growth. The current study attempts to investigate the impact of industry structure and businesses operating within these industries on economic growth. This paper uses “k-means” clustering algorithm to cluster Statistical Local Areas. Regression analysis is utilised to identify drivers of economic growth. Preliminary results suggest that size of business may act as a driver of economic growth but the impact could vary based on location.
Small businesses, Institutions, and Regional Incomes
- Mardaneh, Karim, O'Malley, Tony
- Authors: Mardaneh, Karim , O'Malley, Tony
- Date: 2014
- Type: Text , Conference proceedings
- Relation: 59th ISCB World Conference, Entrepreneurship and sustainability, Dublin, 11th June, 2014
- Full Text:
- Reviewed:
- Description: Regional small businesses may rely on customers who earn income in local and global markets. Small business must transact with suppliers of knowledge and resources, transform those resources into innovative and saleable products or services, and transact with customers. Transformation, transaction and social activities, and the institutions which support them, are necessary for successful small businesses. Regional income and small businesses depend on innovation and trade provided by social and transaction institutions. In this paper we demonstrate this proposition empirically using a model and by investigating the relationship between regional income, transaction institutions, transformation institutions, and social institutions for 140 functional economic regions (FERs) in Australia. The model suggests that social institutions create a macro-environment in which transaction institutions and the transformation and trading activities of businesses can thrive, and help to generate regional income and prosperity. We follow others (Cooke et al., 2007) in arguing that strong transaction institutions are a necessary condition for regional innovation. Social institutions complement transaction institutions by providing education and training, arts and recreation, health care and social services. In the studies reported in this paper the capacity for search and intermediation of exchanges of all kinds (goods, services, knowledge etc.) is measured by the share of transaction institutions in regional employment. The capacity of social institutions is measured by the share of employment in social institutions. We argue that the market failures which cause regional failures to thrive may be made solvable by mobilising market making services to extend and provide governance for regional transactions with faraway markets.
- Authors: Mardaneh, Karim , O'Malley, Tony
- Date: 2014
- Type: Text , Conference proceedings
- Relation: 59th ISCB World Conference, Entrepreneurship and sustainability, Dublin, 11th June, 2014
- Full Text:
- Reviewed:
- Description: Regional small businesses may rely on customers who earn income in local and global markets. Small business must transact with suppliers of knowledge and resources, transform those resources into innovative and saleable products or services, and transact with customers. Transformation, transaction and social activities, and the institutions which support them, are necessary for successful small businesses. Regional income and small businesses depend on innovation and trade provided by social and transaction institutions. In this paper we demonstrate this proposition empirically using a model and by investigating the relationship between regional income, transaction institutions, transformation institutions, and social institutions for 140 functional economic regions (FERs) in Australia. The model suggests that social institutions create a macro-environment in which transaction institutions and the transformation and trading activities of businesses can thrive, and help to generate regional income and prosperity. We follow others (Cooke et al., 2007) in arguing that strong transaction institutions are a necessary condition for regional innovation. Social institutions complement transaction institutions by providing education and training, arts and recreation, health care and social services. In the studies reported in this paper the capacity for search and intermediation of exchanges of all kinds (goods, services, knowledge etc.) is measured by the share of transaction institutions in regional employment. The capacity of social institutions is measured by the share of employment in social institutions. We argue that the market failures which cause regional failures to thrive may be made solvable by mobilising market making services to extend and provide governance for regional transactions with faraway markets.
An overview of geospatial methods used in unintentional injury epidemiology
- Singh, Himalaya, Fortington, Lauren, Thompson, Helen, Finch, Caroline
- Authors: Singh, Himalaya , Fortington, Lauren , Thompson, Helen , Finch, Caroline
- Date: 2016
- Type: Text , Journal article
- Relation: Injury Epidemiology Vol. 3, no. 32 (2016), p. 1-12
- Relation: http://purl.org/au-research/grants/nhmrc/1058737
- Full Text:
- Reviewed:
- Description: BACKGROUND: Injuries are a leading cause of death and disability around the world. Injury incidence is often associated with socio-economic and physical environmental factors. The application of geospatial methods has been recognised as important to gain greater understanding of the complex nature of injury and the associated diverse range of geographically-diverse risk factors. Therefore, the aim of this paper is to provide an overview of geospatial methods applied in unintentional injury epidemiological studies. METHODS: Nine electronic databases were searched for papers published in 2000-2015, inclusive. Included were papers reporting unintentional injuries using geospatial methods for one or more categories of spatial epidemiological methods (mapping; clustering/cluster detection; and ecological analysis). Results describe the included injury cause categories, types of data and details relating to the applied geospatial methods. RESULTS: From over 6,000 articles, 67 studies met all inclusion criteria. The major categories of injury data reported with geospatial methods were road traffic (n = 36), falls (n = 11), burns (n = 9), drowning (n = 4), and others (n = 7). Grouped by categories, mapping was the most frequently used method, with 62 (93%) studies applying this approach independently or in conjunction with other geospatial methods. Clustering/cluster detection methods were less common, applied in 27 (40%) studies. Three studies (4%) applied spatial regression methods (one study using a conditional autoregressive model and two studies using geographically weighted regression) to examine the relationship between injury incidence (drowning, road deaths) with aggregated data in relation to explanatory factors (socio-economic and environmental). CONCLUSION: The number of studies using geospatial methods to investigate unintentional injuries has increased over recent years. While the majority of studies have focused on road traffic injuries, other injury cause categories, particularly falls and burns, have also demonstrated the application of these methods. Geospatial investigations of injury have largely been limited to mapping of data to visualise spatial structures. Use of more sophisticated approaches will help to understand a broader range of spatial risk factors, which remain under-explored when using traditional epidemiological approaches.
- Authors: Singh, Himalaya , Fortington, Lauren , Thompson, Helen , Finch, Caroline
- Date: 2016
- Type: Text , Journal article
- Relation: Injury Epidemiology Vol. 3, no. 32 (2016), p. 1-12
- Relation: http://purl.org/au-research/grants/nhmrc/1058737
- Full Text:
- Reviewed:
- Description: BACKGROUND: Injuries are a leading cause of death and disability around the world. Injury incidence is often associated with socio-economic and physical environmental factors. The application of geospatial methods has been recognised as important to gain greater understanding of the complex nature of injury and the associated diverse range of geographically-diverse risk factors. Therefore, the aim of this paper is to provide an overview of geospatial methods applied in unintentional injury epidemiological studies. METHODS: Nine electronic databases were searched for papers published in 2000-2015, inclusive. Included were papers reporting unintentional injuries using geospatial methods for one or more categories of spatial epidemiological methods (mapping; clustering/cluster detection; and ecological analysis). Results describe the included injury cause categories, types of data and details relating to the applied geospatial methods. RESULTS: From over 6,000 articles, 67 studies met all inclusion criteria. The major categories of injury data reported with geospatial methods were road traffic (n = 36), falls (n = 11), burns (n = 9), drowning (n = 4), and others (n = 7). Grouped by categories, mapping was the most frequently used method, with 62 (93%) studies applying this approach independently or in conjunction with other geospatial methods. Clustering/cluster detection methods were less common, applied in 27 (40%) studies. Three studies (4%) applied spatial regression methods (one study using a conditional autoregressive model and two studies using geographically weighted regression) to examine the relationship between injury incidence (drowning, road deaths) with aggregated data in relation to explanatory factors (socio-economic and environmental). CONCLUSION: The number of studies using geospatial methods to investigate unintentional injuries has increased over recent years. While the majority of studies have focused on road traffic injuries, other injury cause categories, particularly falls and burns, have also demonstrated the application of these methods. Geospatial investigations of injury have largely been limited to mapping of data to visualise spatial structures. Use of more sophisticated approaches will help to understand a broader range of spatial risk factors, which remain under-explored when using traditional epidemiological approaches.
Imbalanced data classification and its application in cyber security
- Authors: Moniruzzaman, Md
- Date: 2020
- Type: Text , Thesis , PhD
- Full Text:
- Description: Cyber security, also known as information technology security or simply as information security, aims to protect government organizations, companies and individuals by defending their computers, servers, electronic systems, networks, and data from malicious attacks. With the advancement of client-side on the fly web content generation techniques, it becomes easier for attackers to modify the content of a website dynamically and gain access to valuable information. The impact of cybercrime to the global economy is now more than ever, and it is growing day by day. Among various types of cybercrimes, financial attacks are widely spread and the financial sector is among most targeted. Both corporations and individuals are losing a huge amount of money each year. The majority portion of financial attacks is carried out by banking malware and web-based attacks. The end users are not always skilled enough to differentiate between injected content and actual contents of a webpage. Designing a real-time security system for ensuring a safe browsing experience is a challenging task. Some of the existing solutions are designed for client side and all the users have to install it in their system, which is very difficult to implement. In addition, various platforms and tools are used by organizations and individuals, therefore, different solutions are needed to be designed. The existing server-side solution often focuses on sanitizing and filtering the inputs. It will fail to detect obfuscated and hidden scripts. This is a realtime security system and any significant delay will hamper user experience. Therefore, finding the most optimized and efficient solution is very important. To ensure an easy installation and integration capabilities of any solution with the existing system is also a critical factor to consider. If the solution is efficient but difficult to integrate, then it may not be a feasible solution for practical use. Unsupervised and supervised data classification techniques have been widely applied to design algorithms for solving cyber security problems. The performance of these algorithms varies depending on types of cyber security problems and size of datasets. To date, existing algorithms do not achieve high accuracy in detecting malware activities. Datasets in cyber security and, especially those from financial sectors, are predominantly imbalanced datasets as the number of malware activities is significantly less than the number of normal activities. This means that classifiers for imbalanced datasets can be used to develop supervised data classification algorithms to detect malware activities. Development of classifiers for imbalanced data sets has been subject of research over the last decade. Most of these classifiers are based on oversampling and undersampling techniques and are not efficient in many situations as such techniques are applied globally. In this thesis, we develop two new algorithms for solving supervised data classification problems in imbalanced datasets and then apply them to solve malware detection problems. The first algorithm is designed using the piecewise linear classifiers by formulating this problem as an optimization problem and by applying the penalty function method. More specifically, we add more penalty to the objective function for misclassified points from minority classes. The second method is based on the combination of the supervised and unsupervised (clustering) algorithms. Such an approach allows one to identify areas in the input space where minority classes are located and to apply local oversampling or undersampling. This approach leads to the design of more efficient and accurate classifiers. The proposed algorithms are tested using real-world datasets. Results clearly demonstrate superiority of newly introduced algorithms. Then we apply these algorithms to design classifiers to detect malwares.
- Description: Doctor of Philosophy
- Authors: Moniruzzaman, Md
- Date: 2020
- Type: Text , Thesis , PhD
- Full Text:
- Description: Cyber security, also known as information technology security or simply as information security, aims to protect government organizations, companies and individuals by defending their computers, servers, electronic systems, networks, and data from malicious attacks. With the advancement of client-side on the fly web content generation techniques, it becomes easier for attackers to modify the content of a website dynamically and gain access to valuable information. The impact of cybercrime to the global economy is now more than ever, and it is growing day by day. Among various types of cybercrimes, financial attacks are widely spread and the financial sector is among most targeted. Both corporations and individuals are losing a huge amount of money each year. The majority portion of financial attacks is carried out by banking malware and web-based attacks. The end users are not always skilled enough to differentiate between injected content and actual contents of a webpage. Designing a real-time security system for ensuring a safe browsing experience is a challenging task. Some of the existing solutions are designed for client side and all the users have to install it in their system, which is very difficult to implement. In addition, various platforms and tools are used by organizations and individuals, therefore, different solutions are needed to be designed. The existing server-side solution often focuses on sanitizing and filtering the inputs. It will fail to detect obfuscated and hidden scripts. This is a realtime security system and any significant delay will hamper user experience. Therefore, finding the most optimized and efficient solution is very important. To ensure an easy installation and integration capabilities of any solution with the existing system is also a critical factor to consider. If the solution is efficient but difficult to integrate, then it may not be a feasible solution for practical use. Unsupervised and supervised data classification techniques have been widely applied to design algorithms for solving cyber security problems. The performance of these algorithms varies depending on types of cyber security problems and size of datasets. To date, existing algorithms do not achieve high accuracy in detecting malware activities. Datasets in cyber security and, especially those from financial sectors, are predominantly imbalanced datasets as the number of malware activities is significantly less than the number of normal activities. This means that classifiers for imbalanced datasets can be used to develop supervised data classification algorithms to detect malware activities. Development of classifiers for imbalanced data sets has been subject of research over the last decade. Most of these classifiers are based on oversampling and undersampling techniques and are not efficient in many situations as such techniques are applied globally. In this thesis, we develop two new algorithms for solving supervised data classification problems in imbalanced datasets and then apply them to solve malware detection problems. The first algorithm is designed using the piecewise linear classifiers by formulating this problem as an optimization problem and by applying the penalty function method. More specifically, we add more penalty to the objective function for misclassified points from minority classes. The second method is based on the combination of the supervised and unsupervised (clustering) algorithms. Such an approach allows one to identify areas in the input space where minority classes are located and to apply local oversampling or undersampling. This approach leads to the design of more efficient and accurate classifiers. The proposed algorithms are tested using real-world datasets. Results clearly demonstrate superiority of newly introduced algorithms. Then we apply these algorithms to design classifiers to detect malwares.
- Description: Doctor of Philosophy
Partial undersampling of imbalanced data for cyber threats detection
- Moniruzzaman, Md, Bagirov, Adil, Gondal, Iqbal
- Authors: Moniruzzaman, Md , Bagirov, Adil , Gondal, Iqbal
- Date: 2020
- Type: Text , Conference proceedings , Conference paper
- Relation: 2020 Australasian Computer Science Week Multiconference, ACSW 2020
- Full Text:
- Reviewed:
- Description: Real-time detection of cyber threats is a challenging task in cyber security. With the advancement of technology and ease of access to the internet, more and more individuals and organizations are becoming the target for various cyber attacks such as malware, ransomware, spyware. The target of these attacks is to steal money or valuable information from the victims. Signature-based detection methods fail to keep up with the constantly evolving new threats. Machine learning based detection has drawn more attention of researchers due to its capability of detecting new and modified attacks based on previous attack's behaviour. The number of malicious activities in a certain domain is significantly low compared to the number of normal activities. Therefore, cyber threats detection data sets are imbalanced. In this paper, we proposed a partial undersampling method to deal with imbalanced data for detecting cyber threats. © 2020 ACM.
- Description: E1
- Authors: Moniruzzaman, Md , Bagirov, Adil , Gondal, Iqbal
- Date: 2020
- Type: Text , Conference proceedings , Conference paper
- Relation: 2020 Australasian Computer Science Week Multiconference, ACSW 2020
- Full Text:
- Reviewed:
- Description: Real-time detection of cyber threats is a challenging task in cyber security. With the advancement of technology and ease of access to the internet, more and more individuals and organizations are becoming the target for various cyber attacks such as malware, ransomware, spyware. The target of these attacks is to steal money or valuable information from the victims. Signature-based detection methods fail to keep up with the constantly evolving new threats. Machine learning based detection has drawn more attention of researchers due to its capability of detecting new and modified attacks based on previous attack's behaviour. The number of malicious activities in a certain domain is significantly low compared to the number of normal activities. Therefore, cyber threats detection data sets are imbalanced. In this paper, we proposed a partial undersampling method to deal with imbalanced data for detecting cyber threats. © 2020 ACM.
- Description: E1
Impact of node deployment and routing for protection of critical infrastructures
- Subhan, Fazli, Noreen, Madiha, Imran, Muhammad, Tariq, Moeenuddin, Khan, Asfandyar, Shoaib, Muhammad
- Authors: Subhan, Fazli , Noreen, Madiha , Imran, Muhammad , Tariq, Moeenuddin , Khan, Asfandyar , Shoaib, Muhammad
- Date: 2019
- Type: Text , Journal article
- Relation: IEEE Access Vol. 7, no. (2019), p. 11502-11514
- Full Text:
- Reviewed:
- Description: Recently, linear wireless sensor networks (LWSNs) have been eliciting increasing attention because of their suitability for applications such as the protection of critical infrastructures. Most of these applications require LWSN to remain operational for a longer period. However, the non-replenishable limited battery power of sensor nodes does not allow them to meet these expectations. Therefore, a shorter network lifetime is one of the most prominent barriers in large-scale deployment of LWSN. Unlike most existing studies, in this paper, we analyze the impact of node placement and clustering on LWSN network lifetime. First, we categorize and classify existing node placement and clustering schemes for LWSN and introduce various topologies for disparate applications. Then, we highlight the peculiarities of LWSN applications and discuss their unique characteristics. Several application domains of LWSN are described. We present three node placement strategies (i.e., linear sequential, linear parallel, and grid) and various deployment methods such as random, uniform, decreasing distance, and triangular. Extensive simulation experiments are conducted to analyze the performance of the three state-of-the-art routing protocols in the context of node deployment strategies and methods. The experimental results demonstrate that the node deployment strategies and methods significantly affect LWSN lifetime. © 2013 IEEE.
- Authors: Subhan, Fazli , Noreen, Madiha , Imran, Muhammad , Tariq, Moeenuddin , Khan, Asfandyar , Shoaib, Muhammad
- Date: 2019
- Type: Text , Journal article
- Relation: IEEE Access Vol. 7, no. (2019), p. 11502-11514
- Full Text:
- Reviewed:
- Description: Recently, linear wireless sensor networks (LWSNs) have been eliciting increasing attention because of their suitability for applications such as the protection of critical infrastructures. Most of these applications require LWSN to remain operational for a longer period. However, the non-replenishable limited battery power of sensor nodes does not allow them to meet these expectations. Therefore, a shorter network lifetime is one of the most prominent barriers in large-scale deployment of LWSN. Unlike most existing studies, in this paper, we analyze the impact of node placement and clustering on LWSN network lifetime. First, we categorize and classify existing node placement and clustering schemes for LWSN and introduce various topologies for disparate applications. Then, we highlight the peculiarities of LWSN applications and discuss their unique characteristics. Several application domains of LWSN are described. We present three node placement strategies (i.e., linear sequential, linear parallel, and grid) and various deployment methods such as random, uniform, decreasing distance, and triangular. Extensive simulation experiments are conducted to analyze the performance of the three state-of-the-art routing protocols in the context of node deployment strategies and methods. The experimental results demonstrate that the node deployment strategies and methods significantly affect LWSN lifetime. © 2013 IEEE.
Subgraph adaptive structure-aware graph contrastive learning
- Chen, Zhikui, Peng, Yin, Yu, Shuo, Cao, Chen, Xia, Feng
- Authors: Chen, Zhikui , Peng, Yin , Yu, Shuo , Cao, Chen , Xia, Feng
- Date: 2022
- Type: Text , Journal article
- Relation: Mathematics (Basel) Vol. 10, no. 17 (2022), p. 3047
- Full Text:
- Reviewed:
- Description: Graph contrastive learning (GCL) has been subject to more attention and been widely applied to numerous graph learning tasks such as node classification and link prediction. Although it has achieved great success and even performed better than supervised methods in some tasks, most of them depend on node-level comparison, while ignoring the rich semantic information contained in graph topology, especially for social networks. However, a higher-level comparison requires subgraph construction and encoding, which remain unsolved. To address this problem, we propose a subgraph adaptive structure-aware graph contrastive learning method (PASCAL) in this work, which is a subgraph-level GCL method. In PASCAL, we construct subgraphs by merging all motifs that contain the target node. Then we encode them on the basis of motif number distribution to capture the rich information hidden in subgraphs. By incorporating motif information, PASCAL can capture richer semantic information hidden in local structures compared with other GCL methods. Extensive experiments on six benchmark datasets show that PASCAL outperforms state-of-art graph contrastive learning and supervised methods in most cases.
- Authors: Chen, Zhikui , Peng, Yin , Yu, Shuo , Cao, Chen , Xia, Feng
- Date: 2022
- Type: Text , Journal article
- Relation: Mathematics (Basel) Vol. 10, no. 17 (2022), p. 3047
- Full Text:
- Reviewed:
- Description: Graph contrastive learning (GCL) has been subject to more attention and been widely applied to numerous graph learning tasks such as node classification and link prediction. Although it has achieved great success and even performed better than supervised methods in some tasks, most of them depend on node-level comparison, while ignoring the rich semantic information contained in graph topology, especially for social networks. However, a higher-level comparison requires subgraph construction and encoding, which remain unsolved. To address this problem, we propose a subgraph adaptive structure-aware graph contrastive learning method (PASCAL) in this work, which is a subgraph-level GCL method. In PASCAL, we construct subgraphs by merging all motifs that contain the target node. Then we encode them on the basis of motif number distribution to capture the rich information hidden in subgraphs. By incorporating motif information, PASCAL can capture richer semantic information hidden in local structures compared with other GCL methods. Extensive experiments on six benchmark datasets show that PASCAL outperforms state-of-art graph contrastive learning and supervised methods in most cases.
- «
- ‹
- 1
- ›
- »