Accurate and efficient clustering algorithms for very large data sets

Quddus, Syed

Title: Accurate and efficient clustering algorithms for very large data sets
Creator: Quddus, Syed
Date: 2017
Type: Text; Thesis; PhD
Identifier: http://researchonline.federation.edu.au/vital/access/HandleResolver/1959.17/162586
Identifier: vital:12701
Identifier: https://library.federation.edu.au/record=b2746800
Abstract: The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.; Doctor of Philosophy
Publisher: Federation University Australia
Rights: Copyright Syed Abdul Quddus
Rights: Open Access
Rights: This metadata is freely available under a CCO license
Subject: Clustering algorithms; Very large data sets; Data mining
Full Text

Hits: 1740
Visitors: 2342
Downloads: 857

		Thumbnail	File	Description	Size	Format
View Details Download			SOURCE1	Australian Digital Thesis	4 MB	Adobe Acrobat PDF	View Details Download