List of Titles

Machine learning algorithms for analysis of DNA data sets

Authors: Yearwood, John , Bagirov, Adil , Kelarev, Andrei
Date: 2012
Type: Text , Book chapter
Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 47-58
Relation: http://purl.org/au-research/grants/arc/LP0990908
Full Text: false
Reviewed:
Description: The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors' experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors' k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.

Optimization methods and the k-committees algorithm for clustering of sequence data

Authors: Yearwood, John , Bagirov, Adil , Kelarev, Andrei
Date: 2009
Type: Text , Journal article
Relation: Applied and Computational Mathematics Vol. 8, no. 1 (2009), p. 92-101
Relation: http://purl.org/au-research/grants/arc/DP0211866
Relation: http://purl.org/au-research/grants/arc/DP0666061
Full Text: false
Description: The present paper is devoted to new algorithms for unsupervised clustering based on the optimization approaches due to [2], [3] and [4]. We consider a novel situation, where the datasets consist of nucleotide or protein sequences and rather sophisticated biologically significant alignment scores have to be used as a measure of distance. Sequences of this kind cannot be regarded as points in a finite dimensional space. Besides, the alignment scores do not satisfy properties of Minkowski metrics. Nevertheless the optimization approaches have made it possible to introduce a new k-committees algorithm and compare its performance with previous algorithms for two datasets. Our experimental results show that the k-committees algorithms achieves intermediate accuracy for a dataset of ITS sequences, and it can perform better than the discrete k-means and Nearest Neighbour algorithms for certain datasets. All three algorithms achieve good agreement with clusters published in the biological literature before and can be used to obtain biologically significant clusterings.

Showing items 1 - 2 of 2

Machine learning algorithms for analysis of DNA data sets

Optimization methods and the k-committees algorithm for clustering of sequence data