Partial undersampling of imbalanced data for cyber threats detection
- Authors: Moniruzzaman, Md , Bagirov, Adil , Gondal, Iqbal
- Date: 2020
- Type: Text , Conference proceedings , Conference paper
- Relation: 2020 Australasian Computer Science Week Multiconference, ACSW 2020
- Full Text:
- Reviewed:
- Description: Real-time detection of cyber threats is a challenging task in cyber security. With the advancement of technology and ease of access to the internet, more and more individuals and organizations are becoming the target for various cyber attacks such as malware, ransomware, spyware. The target of these attacks is to steal money or valuable information from the victims. Signature-based detection methods fail to keep up with the constantly evolving new threats. Machine learning based detection has drawn more attention of researchers due to its capability of detecting new and modified attacks based on previous attack's behaviour. The number of malicious activities in a certain domain is significantly low compared to the number of normal activities. Therefore, cyber threats detection data sets are imbalanced. In this paper, we proposed a partial undersampling method to deal with imbalanced data for detecting cyber threats. © 2020 ACM.
- Description: E1
Neighbourhood contrast : A better means to detect clusters than density
- Authors: Chen, Bo , Ting, Kaiming
- Date: 2018
- Type: Text , Conference paper
- Relation: 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018; Melbourne, Australia; 3rd-6th June 2018; published in Lecutre Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10939 LNAI, p. 401-412
- Full Text: false
- Reviewed:
- Description: Most density-based clustering algorithms suffer from large density variations among clusters. This paper proposes a new measure called Neighbourhood Contrast (NC) as a better alternative to density in detecting clusters. The proposed NC admits all local density maxima, regardless of their densities, to have similar NC values. Due to this unique property, NC is a better means to detect clusters in a dataset with large density variations among clusters. We provide two applications of NC. First, replacing density with NC in the current state-of-the-art clustering procedure DP leads to significantly improved clustering performance. Second, we devise a new clustering algorithm called Neighbourhood Contrast Clustering (NCC) which does not require density or distance calculations, and therefore has a linear time complexity in terms of dataset size. Our empirical evaluation shows that both NC-based methods outperform density-based methods including the current state-of-the-art.
A framework for clustering and dynamic maintenance of xml documents
- Authors: Al-Shammari, Ahmed , Liu, Chengfei , Naseriparsa, Mehdi , Vo, Bao , Anwar, Tarique , Zhou, Rrui
- Date: 2017
- Type: Text , Conference paper
- Relation: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017 Vol. 10604 LNAI, p. 399-412
- Full Text: false
- Reviewed:
- Description: Web data clustering has been widely studied in the data mining communities. However, dynamic maintenance of the web data clusters is still a challenging task. In this paper, we propose a novel framework called XClusterMaint which serves for both clustering and maintenance of the XML documents. For clustering, we take both structure and content into account and propose an efficient solution for grouping the documents based on the combination of structure and content similarity. For maintenance, we propose an incremental approach for maintaining the existing clusters dynamically when we receive new incoming XML documents. Since the dynamic maintenance of the clusters is computationally expensive, we also propose an improved approach which uses a lazy maintenance scheme to improve the performance of the clusters maintenance. The experimental results on real datasets verify the efficiency of the proposed clustering and maintenance model. © Springer International Publishing AG 2017.
Frequency decomposition based gene clustering
- Authors: Rahman, Md Abdur , Chetty, Madhu , Bulach, Dieter , Wangikar, Pramod
- Date: 2015
- Type: Text , Conference paper
- Relation: 22nd International Conference on Neural Information Processing, ICONIP 2015; Istanbul, Turkey; 9th-12th November 2015 Vol. 9490, p. 170-181
- Full Text: false
- Reviewed:
- Description: Gene expressions have been commonly applied to understand the inherent underlying mechanism of known biological processes. Although the microarray gene expressions usually appear aperiodic, with proper signal processing techniques, its periodic components can be easily obtained. Thus, if expressions of interconnected (regulatory and regulated) genes are decomposed, at least one common frequency component will appear in these genes. Exploiting this novel concept, we propose a frequency decomposition approach for gene clustering to better understand the gene interconnection topology. This method, based on Hilbert Huang Transform (HHT) enables us to segregate every periodic component of the gene expressions. Next, a multilevel clustering is performed based on these frequency components. Unlike existing clustering algorithms, the proposed method assimilates a meaningful knowledge of the gene interactions topology. The information related to underlying gene interactions is vital and can prove useful in many existing evolutionary optimisation algorithms for genetic network reconstruction. We validate the entire approach by its application to a 15-gene synthetic network. © Springer International Publishing Switzerland 2015.
A new modification of Kohonen neural network for VQ and clustering problems
- Authors: Mohebi, Ehsan , Bagirov, Adil
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings of the 11-th Australasian Data Mining Conference (AusDM'13) Vol. 146, p. 81-88
- Full Text: false
- Reviewed:
- Description: Vector Quantization (VQ) and Clustering are significantly important in the field of data mining and pattern recognition. The Self Organizing Maps has been widely used for clustering and topology visualization. The topology of the SOM and its initial neurons play an important role in the convergence of the Kohonen neural network. In this paper, in order to improve the convergence of the SOM we introduce an algorithm based on the split and merging of clusters to initialize neurons. We also introduce a topology based on this initialization to optimize the vector quantization error. Such an approach allows one to find global or near global solution to the vector quantization and consequently clustering problem. The numerical results on 4 small to large real-world data sets are reported to demonstrate the performance of the proposed algorithm.
REPLOT: REtrieving profile links on Twitter for suspicious networks detection
- Authors: Perez, Charles , Birregah, Babiga , Layton, Robert , Lemercier, Marc , Watters, Paul
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013 p. 1307-1314
- Full Text: false
- Reviewed:
- Description: In the last few decades social networking sites have encountered their first large-scale security issues. The high number of users associated with the presence of sensitive data (personal or professional) is certainly an unprecedented opportunity for malicious activities. As a result, one observes that malicious users are progressively turning their attention from traditional e-mail to online social networks to carry out their attacks. Moreover, it is now observed that attacks are not only performed by individual profiles, but that on a larger scale, a set of profiles can act in coordination in making such attacks. The latter are referred to as malicious social campaigns. In this paper, we present a novel approach that combines authorship attribution techniques with a behavioural analysis for detecting and characterizing social campaigns. The proposed approach is performed in three steps: first, suspicious profiles are identified from a behavioural analysis; second, connections between suspicious profiles are retrieved using a combination of authorship attribution and temporal similarity; third, a clustering algorithm is performed to identify and characterise the suspicious campaigns obtained. We provide a real-life application of the methodology on a sample of 1,000 suspicious Twitter profiles tracked over a period of forty days. Our results show that a large set of suspicious profiles behaves in coordination (70%) and propagates mainly, but not only, trustworthy URLs on the online social network. Among the three largest detected campaigns, we have highlighted that one represents an important security issue for the platform by promoting a significant set of malicious URLs. Copyright 2013 ACM.
The impact of global and local features on multiple sequence alignment clustering-based near-duplicate video retrieval
- Authors: Wang, Yandan , Lu, Guojun , Belkhatir, Mohammed , Messom, Christopher
- Date: 2013
- Type: Text , Conference paper
- Relation: 14th Pacific-Rim Conference on Multimedia p. 669-677
- Full Text: false
- Reviewed:
- Description: Traditionally, the performance of Near-Duplicate Video Retrieval (NDVR) is enhanced through different video features, matching scheme and indexing methods. The video features have been intensively investigated and it has been shown that local features outperform global features in terms of accuracy. However, local features have the expensive computational problem. Therefore, indexing structure is introduced to assist in scaling up, whilst the accuracy will drop slightly or dramatically in most time by using indexing approaches. Recent progress shows that NDVR based on clustering could reduce searching space while maintains equivalent retrieval accuracy compared to that of non-clustering based. In this paper, we will continue to evaluate clustering based NDVR, but using popular global and local features. Before conducting NDVR, dataset will be pre-processed offline into groups by using clustering algorithm that near-duplicate videos (NDVs) are assembled in the same cluster. Each cluster will be represented by member video or the centroid. The query video will then be compared to the representative videos instead of all videos in database (non-clustering based). Our experiment shows that clustering-based NDVR using global and local features outperforms than that of non-clustering based in terms of both retrieval accuracy and speed.
SMEs and the economic growth: A comparative study of clustering techniques in SMEs data analysis
- Authors: Mardaneh, Karim
- Date: 2012
- Type: Text , Conference paper
- Relation: Conference Proceedings: 57th ICSB World Conference
- Full Text: false
- Reviewed:
- Description: Regional economic planning of small-to-medium enterprises (SMEs) requires a thorough understanding of the industry structure and the size of business. The main body of the literature regarding SMEs is focused on formation and growth, as well as success and failure (Dejardin & Fritsch, 2010). Some studies have considered clustering regional areas based on functional specialisation but only a few studies have considered industry structure and the size of business (Okamuro, 2006). This area of study may require large data sets and sophisticated clustering techniques, which have not been used in SMEs research. Using the Australian data and a large data set for regional (non-metropolitan) areas, this current study attempts to investigate the relationship between the economic growth of geographical areas with the industry structure and size of the businesses within those areas. For this the study uses Ward’s, the k-means, global k-means, and the modified global k-means clustering algorithms to cluster the Statistical Local Areas (SLA), and compares the function of these algorithms to identify the algorithm that performs the clustering task of the SMEs data more efficiently. Resulting analysis of this comparative study demonstrates that the modified global k-means algorithm outperforms the other algorithms examined.
- Description: E1
An application of novel clustering technique for information security
- Authors: Beliakov, Gleb , Yearwood, John , Kelarev, Andrei
- Date: 2011
- Type: Text , Conference paper
- Relation: Applications and Techniques in Information Security Workshop p. 5-11
- Full Text: false
- Reviewed:
- Description: This article presents experimental results devoted to a new application of the novel clustering technique introduced by the authors recently. Our aim is to facilitate the application of robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on the particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, we use a consensus function to combine these independent clusterings into one consensus clustering . Feature ranking is used to select a subset of features for the consensus function. Third, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of three consensus functions, Cluster-Based Graph Formulation (CBGF), Hybrid Bipartite Graph Formulation (HBGF), and Instance-Based Graph Formulation (IBGF) and a variety of supervised classification algorithms. The best precision and recall have been obtained by the combination of the HBGF consensus function and the SMO classifier with the polynomial kernel.
- Description: 2003009195
Dynamic Bayesian network modeling of cyanobacterial biological processes via gene clustering
- Authors: Nguyen, Vinh , Chetty, Madhu , Coppel, Ross , Wangikar, Pramod
- Date: 2011
- Type: Text , Conference paper
- Relation: 18th International Conference on Neural Information Processing, ICONIP 2011; Shanghai; China; 13th-17th November 2011; published in (Lecture Notes in Computer Science series) Vol. 7062 (1) pg 97-106
- Full Text: false
- Reviewed:
- Description: Cyanobacteria are photosynthetic organisms that are credited with both the creation and replenishment of the oxygen-rich atmosphere, and are also responsible for more than half of the primary production on earth. Despite their crucial evolutionary and environmental roles, the study of these organisms has lagged behind other model organisms. This paper presents preliminary results on our ongoing research to unravel the biological interactions occurring within cyanobacteria. We develop an analysis framework that leverages recently developed bioinformatics and machine learning tools, such as genome-wide sequence matching based annotation, gene ontology analysis, cluster analysis and dynamic Bayesian network. Together, these tools allow us to overcome the lack of knowledge of less well-studied organisms, and reveal interesting relationships among their biological processes. Experiments on the Cyanothece bacterium demonstrate the practicability and usefulness of our approach. © 2011 Springer-Verlag.
- Description: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011, Vol.7062 (1), pp.97-106
A new modified global k-means algorithm for clustering large data sets
- Authors: Bagirov, Adil , Ugon, Julien , Webb, Dean
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at XIIIth International Conference : Applied Stochastic Models and Data Analysis, ASMDA 2009, Vilnius, Lithuania : 30th June - 3rd July 2009 p. 1-5
- Full Text: false
- Description: The k-means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and inefficient for solving clustering problems in large data sets. Recently, in order to resolve difficulties with the choice of starting points, incremental approaches have been developed. The modified global k-means algorithm is based on such an approach. It iteratively adds one cluster center at a time. Numerical experiments show that this algorithm considerably improve the k-means algorithm. However, this algorithm is not suitable for clustering very large data sets. In this paper, a new version of the modified global k-means algorithm is proposed. We introduce an auxiliary cluster function to generate a set of starting points spanning different parts of the data set. We exploit information gathered in previous iterations of the incremental algorithm to reduce its complexity.
- Description: 2003007558
Establishing phishing provenance using orthographic features
- Authors: Liping, Ma , Yearwood, John , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at 2009 eCrime Researchers Summit, eCRIME '09, Tacoma, Washington : 20th-21st October 2009
- Full Text:
- Description: After phishing message detection, determining the provenance of phishing messages and Websites is the second step to tracing cybercriminals. In this paper, we present a novel method to cluster phishing emails automatically using orthographic features. In particular, we develop an algorithm to cluster documents and remove redundant features at the same time. After collecting all the possible features based on observation, we adapt the modified global k-mean method repeatedly, and generate the objective function values over a range of tolerance values across different subsets of features. Finally, we identify the appropriate clusters based on studying the distribution of the objective function values. Experimental evaluation of a large number of computations demonstrates that our clustering and feature selection techniques are highly effective and achieve reliable results.
- Description: 2003007842
Modified global k-means algorithm for clustering in gene expression data sets
- Authors: Bagirov, Adil , Mardaneh, Karim
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at Intelligent Systems for Bioinformatics 2006, proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, Hobart, Tasmania : 4th December, 2006
- Full Text:
- Reviewed:
- Description: Clustering in gene expression data sets is a challenging problem. Different algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its different variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are significantly different from global solutions as the number of clusters increases. Over the last several years different approaches have been proposed to improve global search properties of k-means algorithm and its performance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modified k-means algorithm improves and sometimes significantly results by k-means and global k-means algorithms.
- Description: E1
- Description: 2003001713
Tourism clusters : Uncovering destination value chains
- Authors: Hollick, Mary , Braun, Patrice
- Date: 2006
- Type: Text , Conference paper
- Relation: Paper presented at CAUTHE 2006 conference - to the city and beyond, Melbourne, Victoria : 6th February, 2006 p. 476-485
- Full Text:
- Reviewed:
- Description: This paper discusses the role of tourism networks, clustering and destination value chains for micro and small and medium size tourism enterprises (SMEs) in freely assembled destinations. In discussing destination benefits and barriers surrounding SME clustering, SME positioning and performance are highlighted. It is proposed in this paper that SME clustering and value are not always naturally established. Successful destination clusters may be created by upgrading SME performance, analysing local value chains and matching both tangible and intangible sources of value, such as systems, leadership, relationships and brands with demand-side value segmentation.
- Description: E1
- Description: 2003001808
A CAD system using clustering and novel feature extraction technique
- Authors: Ghosh, Ranadhir , Ghosh, Moumita , Yearwood, John
- Date: 2005
- Type: Text , Conference paper
- Relation: Paper presented at CISTM 2005, Gurgaon, India : 24th - 26th July, 2005
- Full Text: false
- Reviewed:
- Description: Many previous efforts have utilized many different approaches for recognition in breast cancer detection using various ANN classifier-modelling techniques. Most of the previous work was concentred mostly on the classification of the damaged areas with the help of doctor’s suggestion. Doctors use to mark the suspicious areas area in the mammogram and the classifier only extract those marked areas and tries to classify it. An intelligent automatic diagnosis system can be very helpful for radiologist in diagnosing Breast cancer. In this research we are applying a local search gradient free clustering algorithm to find out the suspicious / damaged area. We compare our results with the doctor’s marking. Also it has been observed that, beyond a certain point, the inclusion of additional features leads to a worse rather than better performance. Moreover, the choice of features to represent the patterns affects several aspects of pattern recognition problems such as accuracy, required learning time and a necessary number of samples. A common problem with the multi-category feature classification is the conflict between the categories. None of the feasible solutions allow simultaneous optimal solution for all categories. In order to find an optimal solution the search space can be divided based on an individual category in each sub region and finally merging them through decision spport system. Combining the feature selection with the classifier has been a major challenge for the researchers. A similar technique employed in both the levels often worsens their performance. Some preliminary studies has revealed that while using traditional canonical GA has been a good choice for feature selection modules, however under perform for the classifier level module. An evolutionary based algorithm for the classifier level provides a much better solution for this purpose. In this paper we propose a hybrid canonical based feature extraction technique with a combination of evolutionary algorithm based classifier using a feed forward MLP model.
- Description: E1
- Description: 2003001369
A hybrid clustering algorithm using two level of abstraction
- Authors: Ghosh, Ranadhir , Mammadov, Musa , Ghosh, Moumita , Yearwood, John
- Date: 2005
- Type: Text , Conference paper
- Relation: Paper presented at Fuzzy Logic, Soft Computing, and Computational Intelligence, 11th International Fuzzy Systems Association World Congress, Beijing, China : 28th - 31st July, 2005
- Full Text: false
- Reviewed:
- Description: E1
- Description: 2003001360
An experiment in task decomposition and ensembling for a modular artificial neural network
- Authors: Ferguson, Brent , Ghosh, Ranadhir , Yearwood, John
- Date: 2004
- Type: Text , Conference paper
- Relation: Paper presented at Innovations in Applied Artificial Intelligence: 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Ottawa, Canada : 17th May, 2004
- Full Text:
- Reviewed:
- Description: Modular neural networks have the possibility of overcoming common scalability and interference problems experienced by fully connected neural networks when applied to large databases. In this paper we trial an approach to constructing modular ANN's for a very large problem from CEDAR for the classification of handwritten characters. In our approach, we apply progressive task decomposition methods based upon clustering and regression techniques to find modules. We then test methods for combining the modules into ensembles and compare their structural characteristics and classification performance with that of an ANN having a fully connected topology. The results reveal improvements to classification rates as well as network topologies for this problem.
- Description: E1
- Description: 2003000852
Two level clustering using SOM and dynamical systems
- Authors: Ghosh, Ranadhir , Mammadov, Musa , Ghosh, Moumita , Yearwood, John
- Date: 2004
- Type: Text , Conference paper
- Relation: Paper presented at ICOTA6: 6th International Conference on Optimization - Techniques and Applications, Ballarat, Victoria : 9th December, 2004
- Full Text: false
- Reviewed:
- Description: E1
- Description: 2003000871